A Model of Information Nudges - Stanford University · 2016-04-26 · an implicit assumption that a...

A Model of Information Nudges

Lucas Co�man∗ Clayton R. Featherstone† Judd B. Kessler†

November 24, 2015

Abstract

A growing empirical literature has demonstrated that providing decision-makers

with information (e.g. about the actions of others or the returns to di�erent actions)

can a�ect behavior. However, the literature lacks a theory that can explain when such

interventions will have a large e�ect or even the sign of the e�ect. We introduce such

a theory, based on simple Bayesian updating in a setting of binary choice. It yields the

following intuitive insight: the sign of the e�ect depends on whether the intervention

causes the marginal agent to update her belief up or down. Further, the magnitude of

the e�ect depends on both the density of agents at the margin and how much those

agents' beliefs move when treated. We also show that when it is prohibitively costly or

impossible to directly measure the beliefs of marginal agents, we can proxy for these

beliefs with the fraction of agents taking the action in the uninformed group. Utilizing

this intuition, our model makes a strong prediction about how treatment e�ect sign and

magnitude will vary with the proportion taking the action in the control group. Our

model reasonably rationalizes results from the literature: we perform a meta-analysis

of informational nudges and �nd that, even across very di�erent experimental settings,

the magnitude of the treatment e�ect varies in a way our theory predicts (note: more

data to come soon).

∗Department of Economics, The Ohio State University†The Wharton School, University of Pennsylvania

1

1 Introduction

In recent years, a large and growing body of empirical work has investigated the role of

�nudges� on behavior.1 As this literature has blossomed, researchers have investigated the

e�cacy of a number of prominent nudges across a wide variety of settings. One surprise from

the broad application of these interventions is that some of the most empirically robust and

behaviorally intuitive nudges sometimes fail to in�uence behavior or to in�uence behavior in

the opposite direction than expected. These results are seen as surprising since many hold

an implicit assumption that a nudge that works in one context will work in all other contexts

rather than relying on theory to assess when a nudge is likely to be e�ective.

This paper introduces such a theory for one of the most popular nudges: providing

individuals with information about a choice they face. We develop a Bayesian updating

model that treats the information as a signal that leads individuals to update about the

utility they receive from one of the outcomes. This information could be social information

(e.g. information that the majority of other people donate to a charity) or information about

the costs or bene�ts of taking actions (e.g. information about the returns to graduating from

high school).2 Providing individuals with information has successfully changed behavior in

a wide variety of contexts.3 However, a number of prominent empirical papers have found

null results from providing information (see, e.g., Allcott and Taubinsky (2015), Avitabile

and De Hoyos Navarro (2015), Bettinger et al. (2012), Hastings, Neilson and Zimmerman

(2015), Slemrod, Blumenthal and Christian (2001)) or found treatment e�ects in the opposite

direction than expected for at least some groups (see, e.g., Fellner, Sausgruber and Traxler

(2013),Bhargava and Manoli (2013), Beshears et al. (2015a)).4 In the absence of a model,

1See Sunstein and Thaler (2008) for a detailed discussion of nudges. This work has been in�uential inthe policy domain, spawning nudge units in the U.K. (called the Behavioral Insights Team), U.S. (called theSocial and Behavioral Sciences Team), and around the world.

2See Vesterlund (2003) for a model of how sequential fund-raising can allow potential donors to provideinformation to one another about the quality of a charity. Our model is in this spirit and inspired byVesterlund (2003) but considers a general information structure.

3Information about others' decisions has a�ected decisions to donate money (see, e.g., Frey and Meier(2004), Martin and Randal (2008), Croson and Shang (2008), Shang and Croson (2009)), rate movies (Chenet al. (2010)), order certain entrées (Cai, Chen and Fang (2009)), save energy (Allcott (2011)), reuse tow-els (Goldstein, Cialdini and Griskevicius (2008)), pay taxes (Hallsworth et al. (2014)), like certain songs(Salganik, Dodds and Watts (2006)), steal petri�ed wood (Cialdini et al. (2006)), intend to vote (Ger-ber and Rogers (2009)), litter (Cialdini, Reno and Kallgren (1990)), take a job (Co�man, Featherstone andKessler (2014)), give money in a laboratory public goods games (Keser and Van Winden (2000), Fischbacher,Gächter and Fehr (2001), and Potters, Sefton and Vesterlund (2005)). Information about the costs or bene-�ts of di�erent actions has been shown to a�ect school choice (Hastings and Weinstein (2007)), standardizedtest scores (Nguyen (2008)), graduation rates (Jensen (2010)), claiming tax bene�ts (Bhargava and Manoli(2013)), tax compliance (Pomeranz (2015)), 401k contribution levels (Clark, Maki and Morrill (2014)), eat-ing fewer calories (Bollinger, Leslie and Sorensen (2011)), responding to energy price changes (Jessoe andRapson (2014)), and purchasing �uorescent light bulbs (Allcott and Taubinsky (2015)), among many others.

4While information interventions are the focus of this paper, nudges in other domains may also fail to

2

many papers attempt to rationalize their null or negative results by alluding to the potential

for negative belief updates � implying that the information treatment ended up leading

agents to update in the opposite direction than had been intended � an argument based on

the intuition that nudges are more likely to be e�ective when the information about the value

of the action is �good news� relative to the average belief (see, e.g., Schultz et al. (2007)).5

Our model focuses on binary decision problems and shows that with Bayesian updating

and very general assumptions, this standard intuition is wrong. Our key insight is that what

matters for whether the treatment e�ect of an information nudge is positive or negative

is how the information a�ects people at the margin. This distinction is important since it

reveals that an information nudge that is �good news� about the value of an action relative to

average beliefs or that is �good news� about the value of an action for the majority of people

need not increase the number of people who take the action. The results suggest that to

predict whether a nudge will be successful at changing behavior requires knowing something

about the distribution of beliefs of agents and being able to identify who is at the margin.

This �nding highlights the particular value of measuring the distribution of agent beliefs.

However, even if a researcher is unable to measure beliefs (e.g. because it is prohibitively

costly or otherwise unfeasible), our theory suggests it might still be possible to infer beliefs

at the margin. Embedded in the theory is a useful indicator of beliefs of those at the margin:

the take-up rate in the untreated group. We show that the percentage of people who take-up

the action in the untreated group suggests whether the prior beliefs of people at the margin

are more or less positive than the average belief in the population.6

The model therefore suggests a speci�c shape of the relationship between the take-up rate

in the untreated group (i.e., the �baseline�) and the sign and size of the treatment e�ect for

a particular nudge. In particular, in settings with di�erent baselines (and with information

nudges that aim to increase beliefs for a majority of the population, typically the goal of

such a study), as the baseline increases from 0: the treatment e�ect will decrease from 0,

then increase back to 0 (when the information provided in the nudge is identical to the belief

deliver expected outcomes and thus might need mode detailed models to help elucidate when we shouldexpect treatment e�ects. For example, nudges in choice architecture do not always deliver the expectedresults (see, e.g., Kessler and Roth (2015) on �yes-no� vs. �opt-in� choice frames for active choice organdonation requests of the type made at Departments of Motor Vehicles). See Beshears et al. (2015b) for ananalysis of which sub-populations will respond to defaults in 401(k) contribution levels.

5This argument is often made in settings without good information on beliefs of individuals and so it ishard to validate.

6To give some intuition for this result, in settings without information on beliefs (e.g. from a survey ofthe untreated), we assume that when the proportion taking the action in the untreated group is low (high),the marginal agents are those with higher (lower) beliefs about the value of taking the action. In this way,if the proportion choosing the action is low (high) in the untreated group, the treatment is more likely todecrease (increase) beliefs.

3

of the people a the margin), then continue to increase, and then decrease to 0 as the baseline

approaches 1.

We support our theory, and this pattern between baseline and treatment e�ects, with a

meta-analysis of informational nudges on binary outcomes in the literature. We �nd that even

across very di�erent experimental settings, the magnitude of the treatment e�ect is predicted

by take-up in the untreated group in the way our theory predicts. Consequently, our theory

rationalizes diverse �ndings from informational interventions present in the literature and

helps to answer the question about why some information nudges �nd null or negative results.

The meta-analysis, with the current data set (more data is currently being collected), comes

with the caveat that we are underpowered to test predictions about the sign of the treatment

e�ect.

We develop some extensions of the model, including extending it to environments when

agents make continuous choices. In those settings, rather than being concerned about people

at the margin, the treatment e�ect depends fundamentally on the people whose marginal

utility is more sensitive to changes in information. In settings of continuous choice, however,

the standard intuition is closer to accurate � the sign of the treatment e�ect depends on

whether information is �good news� for the average agent.

Our approach introduces a convenient methodology for dealing with Bayesian updating,

which often has tractability issues. For instance, Chambers and Healy (2012) show that for

any values of ν, µprior, and µpost, there exist a prior distribution f and an unbiased signal

distribution n (i.e.´ν · n (ν |x) · dν = x) such that upon observing a signal realization ν,

the subject updates her prior expectation from µprior to µpost. Surprisingly, while the theory

of how to compare the posteriors of two subjects who di�er only in the signals they receive

is well known (Milgrom 1981), little has been written about how to compare a subject's

prior to his posterior upon receiving a signal (Featherstone 2015). In this paper, we address

this by formalizing the idea of an information nudge as a signal which has little information

content. In this limit, a simple perturbation theory argument allows us to compute a closed

form expression for the posterior moments in terms of the prior moments. This approach

seems likely to be useful in applied theory more broadly.

While we are focusing on information interventions, other nudges may work partially

through information channels. For example, the speci�c choice architecture selected by

a policy maker could signal information to decision makers about what the policy maker

thinks is best. In addition, reminders, which are assumed to work through inattention (e.g.,

Taubinsky (2014)) have been shown to a�ect beliefs about the probability others take a

certain action (see, e.g., Del Carpio (2013)) and so might also work through an information

channel. To the extent that information is active, the main insights of our model would still

4

apply.

2 A model of information nudges

Our basic model of information nudges is simple. Each agent has a utility function u (x)

and a prior f (x), where x is a scalar outcome. Agents �take-up� the �behavior� if they

expect u (x) to be non-negative, and do not take-up otherwise; that is, utility is de�ned

net of the agent's best alternative. �Nudged� agents receive a common scalar signal of x,

denoted by ν, which is distributed according to the density n (ν |x). Agent heterogeneity

across the population in utility and prior will be summarized by a joint distribution g (· · · )over the moments of the prior (µ ≡ E [x], σ2 ≡ Var [x], etc.) and the derivatives of the utility

function, evaluated at the population mean of the prior expectation. Ultimately, we will be

concerned with two quantities aggregated across the population. The �rst is the fraction of

the population that takes-up without being nudged, which we will call the baseline take-up

rate and denote by β. The second is the di�erence between the fraction who take-up when

nudged and the fraction who take-up without being nudged, which we will call the treatment

e�ect and denote by τ .

The next six subsections �esh out this basic model. In the �rst, we introduce an intuitive

condition that justi�es our model assuming that utility is a function of x alone. In the second

subsection, we introduce a method for approximating the moments of the posterior from the

moments of the prior in the limit where the signal is an information nudge, which we will

formally de�ne in terms of the information content of the signal. In the third subsection, we

derive a condition for whether an agent will take-up in terms of the moments of his prior,

the derivatives of his conditional expected utility, and whether he was nudged. In the fourth

subsection, we will discuss suitable restrictions on how we model the population of agents

with the distribution g. In the �fth subsection, we derive a formula for the treatment e�ect.

Finally, in the sixth subsection, we discuss the assumptions made in the �rst �ve subsections

and discuss the extent to which they are dispensable, as well as what they mean for our

model.

2.1 Interpreting u (x)

Although the information is about x, it is not necessarily the case that x is all that an agent

cares about. In fact, it might even be that an agent doesn't care about x directly, but only

cares about the signal that x sends about other parameters. To capture these intuitions,

consider letting agents have a utility function u (x, y) that is a function of both x as well as

5

a vector of indirectly signaled arguments y. We de�ne the expected utility conditional on x

as

u (x) ≡ E [u (x, y) |x] .

We usually think of information as changing a subject's beliefs, but not her utility function.

To ensure that u (x) doesn't change when the signal is received, we assume that the signal

is independent of y conditional on the true value of the x, that is, ν ⊥ y | x.

Proposition 1. If ν ⊥ y | x, then

• To get the marginal posterior over x, we can simply apply Bayes' rule directly to the

marginal prior over x, that is, f (x | ν) = f(x)·n(ν |x)´f(x)·n(ν | x)·dx .

• Expected utility conditional on x and ν is the same as expected utility conditional on

x, that is, E [u (x, y) |x] = E [u (x, y) |x, ν].

See appendix for formal proof. Intuitively, this conditional independence assumption

rules out the possibility that the signal conveys independent information about both x and

y. Put di�erently, ν can convey information about y � it just can't convey information

about y to a subject that already knows x. When we interpret ν as a noisy signal of x,

this is particularly sensible. So, utility being a function of x alone is justi�ed so long as we

interpret it as the expected utility conditional on x.

2.2 How beliefs respond to an information nudge

Bayes' rule dictates that a nudged agent should have a posterior belief given by

f (x | ν) =f (x) · n (ν |x)´f (x) · n (ν | x) · dx

.

Unfortunately, this doesn't tell us much without further restrictions. For instance, Cham-

bers and Healy (2012) show that for any values of ν, µprior, and µpost, there exist a prior

distribution f and an unbiased signal distribution n (i.e.´ν · n (ν |x) · dν = x) such that

upon observing a signal realization ν, the subject updates her prior from µprior to µpost. Sur-

prisingly, while the theory of how to compare the posteriors of two subjects who di�er only

in the signals they receive is well known (Milgrom 1981), little has been written about how

to compare a subject's prior to his posterior upon receiving a signal (Featherstone 2015).

To gain a foothold on this problem, previous literature has tended to make speci�c

functional form assumptions (cf. Morris and Shin 2002; Angeletos and Pavan 2007; Cornand

and Heinemann 2008). Instead of going this route, we will make two alternative assumptions.

6

The �rst formalizes what it means for ν to be a signal of x: given any realization ν, that

realization should be a maximum likelihood estimator of x, given a uniform prior. This

maximum likelihood assumption means that ν = arg maxx

n (ν | x), which requires that

n2 (ν | ν) = 0 and n22 (ν | ν) < 0. The second assumption formalizes the de�nition of a

nudge: it should only induce a slight di�erence between posterior and prior. Intuitively, this

happens when the signal is only weakly informative, that is, when the distribution of the

signal only weakly changes with x.7 Mathematically, consider the asymptotic as ε → 0 of

a family of signal distributions based on the original: nε,ν0 (ν |x) ≡ n (ν | ν0 + ε · (x− ν0)).

The parameter ν0 serves as a way to keep variation in the signal distinct from variation in

the center about which a Taylor expansion will ultimately be made. Clearly, if take the

ε→ 0 limit and then substitute ν0 = ν, the posterior approaches the prior, since n0,ν (ν |x)

remains constant as x varies. The information nudge approximation is to compute the

ε → 0 asymptotic, substitute ν0 = ν, and then truncate to leading order in ε. Using this

approach, the signal distribution becomes8

[nε,ν0 (ν |x)]ε→0|ν0=ν ≈[n (ν | ν0) + n2 (ν | ν0) · (x− ν0) · ε+

1

2· n22 (ν | ν0) · (x− ν0)2 · ε2

]ν0=ν

,

= n (ν | ν) ·{

1 +1

2· n22 (ν | ν)

n (ν | ν)· (x− ν)2 · ε2

}.

Further, if we re-scale ε by a factor of√−n22(ν | ν)

n(ν | ν), the expression in the curly brackets

becomes proportional to 1 − 12· ε2 · (x− ν)2 , which is exactly what we would have gotten

had we started by assuming that n (ν |x) was the normal density with mean x and very

large variance. In other words, to leading order, nudges look normal. Going forward, we

7Although we base our formalization of nudge on the latter intuition, it also makes the former moreprecise: if we use the nudge approximation, straightforward computation shows that the Fisher informationof the signal becomes small, that is, I (x) ≡ E

[∂∂x {log n (ν |x)}

∣∣x] ∼ O (ε4).

8Note that the integral of this signal distribution with respect to t is not equal to one. This demonstrates asubtlety to our approach: the ν0 = ν substitution should not be taken until after any operation that treats ν

as a variable independent of ν0. For instance, instead of simply´n (ν | ν)·

{1 + 1

2 ·n22(ν | ν)n(ν | ν) · (x− ν)

2 · ε2}·dν,

the integral of the signal distribution should �rst be computed leaving ν0 in place, since the integral is overν and not ν0:

ˆ [n (ν | ν0) + n2 (ν | ν0) · (x− ν0) · ε+

1

2· n22 (ν | ν0) · (x− ν0)

2 · ε2

]· dν

=

{1 +

ˆn2 (ν | ν0) · dν · (x− ν0) · ε+

ˆn22 (ν | ν0) · dν · 1

2· (x− ν0)

2 · ε2

}= 1 +

∂

∂ν0

[ˆn (ν | ν0) · dν

]· (x− ν0) · ε+

∂2

∂ν20

[ˆn (ν | ν0) · dν

]· 1

2· (x− ν0)

2 · ε2 = 1.

7

will re-scale ε so that n (ν |x) ∝ 1 − 12· ε2 · (x− ν)2.9 This approximation will allow us to

make signi�cant headway against the intractability of the general Bayesian update formula.

Speci�cally, now we can calculate any moment of the posterior. Let µ ≡ E [x] and σ2 ≡E[(x− µ)2] represent the mean and variance of the prior, and further let γ1 ≡

E[(x−µ)3]σ3 and

γ2 ≡E[(x−µ)4]

σ4 − 3 represent the prior's skewness and excess kurtosis.10 Then,

Theorem 1. The �rst two moments of the posterior, to leading order in ε2 are

E [x | ν] = µ+ σ2 ·{

(ν − µ)− 1

2· σ · γ1

}· ε2,

Var [x | ν] = σ2 −{σ4 ·

(1 +

1

2· γ2

)− σ3 · γ1 · (ν − µ)

}· ε2.

If γ1 and γ2 are O (ε2) small, then these moments reduce to

E [x | ν] = µ+ σ2 · (ν − µ) · ε2,

Var [x | ν] = σ2 − σ4 · ε2.

Theorem 11 in the appendix provides a general formula for any moment of the posterior,

but for the purposes of this paper, the �rst two moments will su�ce. For the posterior

expectation, the intuition behind the �rst term inside the curly brackets is obvious, but the

intuition for the second is more subtle. Looking to the posterior variance, we see that an

informative signal tends to decrease it from the prior variance. If the prior is asymmetric,

this decrease in variance will disproportionately remove weight from the side the prior skews

towards, which will in turn push the posterior expectation away from that skew. For the

posterior variance, the �rst term in curly brackets is due to the fact that an informative signal

�reassures� subjects. The excess kurtosis term shows that subjects are more reassured when

their prior has more tail risk.11 The skewness term serves as a correction for asymmetry in

tail risk. To see this, consider the case where ν > µ. If γ1 > 0, then the signal pushes the

prior towards the side on which it has the fatter tail. As such, the posterior variance doesn't

decrease as much. If γ1 < 0, then the signal pushes the prior towards the thinner tail, and

the posterior variance decreases even more.

Although it is interesting to note how asymmetry and tail risk a�ect even the �rst two

9Note that di�erent subjects will have di�erent re-scaling if they have di�erent beliefs about the magnitude

of n22(ν | ν)n(ν | ν) . We will ignore this potential heterogeneity going forward, but it could be easily incorporated

into the theory we present.

10The term �excess kurtosis� refers to the fact that a normal distribution hasE[(x−µ)4]

σ4 = 3.11Note that they are, in fact, reassured, since Rohatgi and Székely (1989) proves that the inequality

1 + 12 · γ2 ≥ 0 holds generally.

8

moments of the posterior, for many contexts, we might think that those concerns are unim-

portant, either due to the underlying distributions, or due to bounded rationality. The

intuition behind considering the case where γ1 and γ2 are O (ε2) small is that, for a normal

distribution, these quantities are both zero. For a more general symmetric distribution, γ1 is

zero. So, we will call the prior near-normal if its third and higher central moments are close

to what they would be if the distribution were normal with the same mean and variance,

and we will call the prior near-symmetric if its odd central moments are close to zero.12

2.3 How the individual responds to an information nudge

The previous subsection outlined a method for computing the moments of the posterior given

the moments of the prior. Of course, we are concerned not just with these moments, but

how they a�ect the behavior of our subjects. Fortunately, we can �nd this e�ect by taking

the Taylor series of u (x) centered at the average µ across the population, E [µ]:

u (x) = u′ (E [µ]) ·

{−θ + x− α1 ·

[1

2· (x− E [µ])2 +

∞∑n=3

(−1)n

n!·n−1∏j=2

αj · (x− E [µ])n]}

,

where θ ≡ E [µ] − u(E[µ])u′(E[µ])

, and the αj ≡ −u(j+1)(E[µ])

u(j)(E[µ])are the various ordered coe�cients of

absolute risk aversion, e.g. α1 is the coe�cient of absolute risk aversion, α2 is the coe�cient

of absolute prudence, α3 the coe�cient of absolute temperance, etc. (Kimball 1990, 1992;

Denuit and Eeckhoudt 2010). Again, although risk aversion and its higher-order counterparts

might be interesting in some situations, we think the most interesting applications of our

theory will be when risk aversion can be ignored. The near risk-neutral approximation

is that the αj are O (ε2) small, so that to leading order, the utility is quadratic.

Going forward, we will refer to θ as the threshold, since the part of utility that is

a function of beliefs must be above θ for an agent to take-up.13 This also means that θ

represents the best outside option that an agent has. In the near risk-neutral approximation,

to leading order, an untreated agent will take-up when µ− 12·α1 ·E

[(x− E [µ])2] ≥ θ, while

a treated agent will take-up when µ− 12·α1 ·E

[(x− E [µ])2

∣∣ ν]+σ2 · (ν − µ) · ε2 ≥ θ. Hence,

to leading order, an agent with prior expectation µ is persuaded to take-up whenever µ is

less than σ2 · (ν − µ) · ε2 above µ− 12· α1 ·E

[(x− E [µ])2].14 This will prove useful when we

12Recall that the nth central moment is E [(x− µ)n] where µ = E [µ]. Note that the �rst central moment

is zero, by de�nition.13Note that this assumes that u′ (E [µ]), which is basically equivalent to assuming that higher x is �good

news� about taking-up.14The width of the range does not include 1

2 · α1 ·(E[(x− E [µ])

2∣∣∣ ν]− E

[(x− E [µ])

2])

because that

term is O(α1 · ε2

). This is because the fact that f (x | ν) = f (x)+O

(ε2)implies that the di�erence between

9

aggregate the number of agents persuaded by the nudge across the entire population.

2.4 Modeling a population

So far, we have discussed how information nudges a�ect the decision-making of the individual.

Now, we want aggregate those changes in behavior across an entire population of agents.

Essentially, this requires assuming some joint distribution g over prior moments, i.e. µ, σ2,

etc., and utility parameters, i.e. θ, α1, etc. Depending on what approximations are made

along the way, some of these parameters will not enter into the model, and as such, can

simply be marginalized out. For our purposes, we will only need a distribution over θ, µ,

and σ2.

Before moving on, however, we think a few restrictions on g help to focus the discussion.

First, if all agents draw their priors from the same distribution (i.e. the same random process

of information acquisition), then we might expect that agents' beliefs are independent of their

outside options, that is, θ is independent of (µ, σ2). We call this the common information

acquisition assumption.

We will also assume that µ is independent of σ2. To motivate this, note that the

population-averaged posterior expectation is given by

E[µ+ σ2 · (ν − µ) · ε2

]= E [µ] + ε2 ·

{E[σ2]· (ν − E [µ])− Cov

(σ2, µ

)}.

We might think that, on average, agents should update towards the signal, that is, if ν is

below the population-averaged prior, then the posterior should be below the prior, and if ν

is above the prior, then the posterior should be above the prior. Looking to the formula, if µ

and σ2 are su�ciently positively correlated, then even when the signal is above the popula-

tion averaged prior expectation, we might see the population average posterior expectation

decrease from the prior. Intuitively, if those with prior expectations above the signal are suf-

�ciently more swayed than those with prior expectations below the signal, then the net e�ect

of the signal will be to decrease the population averaged posterior expectation, regardless of

the signal. Keeping µ independent of σ2 prevents this by ensuring that Cov (σ2, µ) = 0. We

call this assumption the update-towards-the-signal assumption.

Together, our two assumptions dictate that the population distribution of agent parame-

ters can be written as g (θ, µ, σ2) = t (θ) ·m (µ) ·s (σ2), that is, as the product of independent

threshold, expectation, and con�dence distributions. Although this may seem extreme, the

results we will derive are not knife's-edge-reliant on these independence assumptions; in fact,

any posterior and prior expectation must be O(ε2).

10

both could be replaced by the assumption that G (θ, µ, σ2)− T (θ) ·M (µ) · S (σ2) is O (ε2)

small, which we call the near independence assumption.

2.5 How the population responds to an information nudge

Among admits whose prior mean and variance are µ and σ2, the fraction persuaded by the

treatment should be the product of the threshold density evaluated at µ, i.e. t (µ), and the

width of the range of thresholds that are persuaded, which is to leading order, σ2·(ν − µ)·ε2.15

Now, to get the fraction persuaded across the entire population, i.e. the treatment e�ect,

we simply take the expectation of our product over the population distribution of (µ, σ2).

Theorem 2. To leading order, under the assumptions made thus far (and summarized in

the next section), the treatment e�ect is equal to either of the following expressions:

τ = ε2 · E[σ2]· E [t (µ)] · (ν − E [µ | θ = µ]) ,

= ε2 · E[σ2]· E [m (θ)] · (ν − E [θ | θ = µ]) .

The second expression comes from the fact that θ ∈ (µ, µ+ σ2 · (ν − µ) · ε2) is (to �rst order

in ε2), equivalent to µ ∈ (θ + σ2 · (ν − θ) · ε2, θ). Conditioning on θ = µ will henceforth

be called conditioning on the marginal agent, as agents with θ = µ are exactly the ones

convinced by nudges. Hence, we have shown that the treatment e�ect's sign is determined

by whether the signal is above or below the average prior of the marginal agent, and its

magnitude is a function of �ve things: the nudge realization (ν), the information content of

the nudge (ε2), how many marginal agents there are (E [t (µ)]), how easily persuaded agents

are on average (E [σ2]), and how far the signal is from the average marginal agent's belief

(ν − E [µ | θ = µ]).

2.6 Discussion of assumptions

In explaining our approach to modeling information nudges, we have introduced assumptions

as they are needed. For convenience, we list them here. The following assumptions are

indispensable to our approach.

Conditional independence: Conditional on the true value of the information being sig-

naled, the signal provides no additional information about other arguments of agents'

utility functions. Without this assumption, we cannot think only in terms of the

quantity that is signaled.

15For a reminder of why this is so, see Section 2.3, and speci�cally Footnote 14.

11

Maximum likelihood: Given one observation of the signal and a uniform prior, that ob-

servation is a maximum likelihood estimator of the underlying x. This assumption is

relatively innocuous: it mainly serves to rule out signals that need to be de-biased, like

ν = x+ 3 with probability 1.

Information nudge: The signal distribution changes very slowly in the true value of the

information being signaled. Speci�cally, we assume that the signal distribution condi-

tional on the true value is the leading order asymptotic in ε of n (ν | ν0 + ε · (x− ν0)),

evaluated at ν0 = ν. Without this assumption, we cannot conclude much of anything

about how the posterior relates the prior.

The rest of our assumptions are made to help the model point at what we think of as the

most intuitive application of the framework. These assumptions can be replaced with others,

and in fact, we will explore a few such alternative models in Section 6 of this paper.

Near symmetric: The skewness of the prior is O (ε2) small. This allows us to ignore

the possibility that the signal �reassures� agents about upside and downside tail risk

asymmetrically.

Near risk-neutral: The coe�cient of absolute risk aversion, as well as the higher-order

coe�cients of absolute risk aversion are O (ε2) small. This allows us to abstract away

from the fact that a signal serves to �reassure� in addition to changing expectations.

Near independent: The di�erence between the joint distribution of (θ, µ, σ2) and the prod-

uct of that distribution's marginals is O (ε2) small. This assumption is a weaker form

of two more intuitively motivated assumptions. The �rst is a common information

acquisition process: across the population, the threshold θ is independent of the belief

parameters (µ, σ2). This allows us to ignore the issues that would arise if agents with

higher beliefs also had better outside options. The second is the update towards the

signal assumption: across the population, σ2 is independent from µ. This allows us to

ignore the issues that would arise if the malleability of an agent's beliefs were strongly

correlated with how optimistic they are about the value of x.

2.7 Indirect information nudges <not polished, but the ideas are

correct>

Our theory thus far has interpreted the nudge signal ν as being on the same scale as the

quantity it provides information about, x. Intuitively, this means that if x is something like

the post-college salary an agent will get if she majors in economics, then ν can be naturally

12

interpreted as a salary, and given a uniform prior, is her best guess at x. Although this is

the most straightforward way to think about information nudges, it is possible to interpret

them more broadly.

For instance, consider our agent who is considering an economics major. She cares about

how much money she will make after college, but doesn't know what that will be. Say she

attends an information session about the major at which she meets 80% of the graduates

from the past 5 years. It is rude to ask the alumni she meets at the session about their

salaries, but she can observe the fraction that are wearing expensive suits. In the most

straightforward application of our model, x is the fraction of alums who can a�ord expensive

suits, and ν is the fraction she observes at the information session. Clearly, observing the

suits of 80% of the universe of economics alums is quite informative about the overall fraction

of economics alums with expensive suits. As such, a straightforward application of the model

is inappropriate. Still, we might think that strong information about the suits that economics

alums wear provides relatively little information about the average salary of an alum. Of

course, how to map the fraction with nice suits into a guess at the average salary is not

obvious. A bit of math will clarify.

Denote the average income of alums by z, the fraction that wear nice suits by x, and our

agent's joint prior by f (x, z). First, we show that a highly informative signal about x is a

weakly informative signal about z if x and z are not strongly dependent. By the same logic

used in Section 2.2, we model this weak dependence by expanding f (x | z) as

f (x | z) ≈ f (x | z0) + f2 (x | z0) · (z − z0) · ε+1

2· f22 (x | z0) · (z − z0)2 · ε2,

where z0 has yet to be determined. From there, we can derive the likelihood of signal ν

conditional on z, n (ν | z), from the likelihood of signal ν conditional on x, n (ν |x):

n (ν | z) =

ˆn (ν | x) · f (x | z) · dx,

≈ˆ

n (ν | x) ·[f (x | z0) + f2 (x | z0) · (z − z0) · ε+

1

2· f22 (x | z0) · (z − y0)2 · ε2

]· dx,

≈ n (ν | z0) + n2 (ν | z0) · (z − z0) · ε+1

2· n22 (ν | z0) · (z − z0)2 · ε2.

Now, de�ne the maximum likelihood estimator of z given signal ν as

z (ν) ≡ arg maxz

n (ν | z)

so that n2 (ν | z (ν)) = 0 and n22 (ν | z (ν)) < 0. Then, if we choose z0 = z (ν), our approxi-

13

mation of the signal distribution conditional on z can be expressed as

n (ν | z) ≈ n (ν | z (ν)) +1

2· n22 (ν | z (ν)) · (z − z (ν))2 · ε2,

= n (ν | z (ν)) ·{

1− 1

2· (z − z (ν))2 · ε2

},

where ε ≡√−n22(ν | z(ν))

n(ν | z(ν)). This is essentially the same expansion we assumed in Section 2.2,

except now, the signal must be �interpreted� through the maximum likelihood estimator,

z (ν). Hence, the moment-update formulas of Theorem 1 remain the same, except that all

prior moments are relative to the prior belief about z, and ν is replaced with z (ν).

Continuing to parallel our earlier derivation, we consider how the logic of Section 2.3

proceeds under our alternative assumptions. Here, we will need to model our agent caring

about whether alums wear nice suits only because this fraction provides information about

alums' salaries.16 This means variation in x does not a�ect the full utility to leading order,

that is u (x, z, y) = u (x′, z, y) +O (ε) for all x, x′. If this is the case, then we can essentially

think of our agent as responding to a weak signal about z and completely ignore x.

Finally, we look at how the logic of Sections 2.4 and 2.5 proceed under our alternative

assumptions. The main wrinkle here is that z (ν) could potentially vary from agent to

agent. One way to deal with this is to assume that all agents will agree on how the direct

signal about x, ν, can be interpreted to a direct signal about z, z (ν). If this is the case,

then all results in this paper carry through replacing ν with z (ν). Alternatively, one could

assume heterogeneity in the interpretation, but in a way that is independent from the other

parameters of the model, in the same way that we assume that µ is independent of σ2. The

net result of this approach would be to replace ν in the theory with the population-averaged

interpretation of ν, E [z (ν)].

So, in Sections 2.2 through 2.5 we discussed the situation when a signal is weakly in-

formative and naturally interpreted. In contrast, this section has discussed the situation

when a signal provides strong information about something that agents do not care directly

about, but that weakly correlates with something they do care about. Both of these situ-

ations can be analyzed with our model of information nudges. The situation in which our

model does not apply is when the signal provides strong information about something agents

care directly about. To make the distinctions more concrete, consider a two-player public

goods game in which players are motivated by some mixture of self-interest and reciprocity.

If Player 1 were told the average contribution of all Player 2s across a large experiment

16We are ruling out �directly� caring about whether alums wear nice suits, that is, if our agent knewall other pertinent information about alums, varying the fraction that wear nice suits would not a�ect herutility.

14

(not just the one to which he is matched), the theory described in Sections 2.2 through 2.5

would apply directly. If Player 1 were told Player 2's gender, then the theory described in

this section would apply, assuming that gender correlates weakly with public goods giving,

and Player 1's reciprocity has no gender-speci�c component, such as �chivalry�. Finally, if

Player 1 were told Player 2's exact contribution, as in Potters, Sefton and Vesterlund (2005,

2007), our theory would not apply at all.

3 Comparative statics

Theorem 2 can help us to understand what makes the treatment e�ect bigger or smaller.

Looking to the primitives of our model, there are �ve main things that can change: the

information content of the nudge, the signal realization, the average con�dence that agents

have in their beliefs, the distribution of the expectation of those beliefs, and the distribution

of outside options. In this section, we will explore how each of these a�ect the treatment

e�ect.

3.1 Signal, con�dence, and strength of nudge

Looking to Theorem 2, it is simple to derive the following elasticities:

∂τ

∂ε2· ε

2

τ= 1,

∂τ

∂E [σ2]· E [σ2]

τ= 1.

All are relatively intuitive. Increasing the parameters ε2 or E [σ2] by some factor simply in-

creases the magnitude of the treatment e�ect, be it positive or negative, by that same factor.

This makes sense, as neither of these parameters are changing the direction in which agents

update, but rather how much they update. Further it makes sense that these parameters

increase the magnitude of the treatment e�ect: ε2 represents the information content of the

nudge, and E [σ2] represents how unsure agents are about their prior expectation. The other

immediate comparative static is

∂τ

∂ν= ε2 · E [t (µ)] · E

[σ2].

It is intuitive that ∂τ∂ν

is always positive, as higher signals should always make taking-up more

attractive. Further, the e�ect of an increase in ν depends on how many agents are marginal

and how con�dent agents are in their prior expectations. Less obviously, the comparative

15

static implies that there are no diminishing returns to increasing ν.

3.2 Thresholds and beliefs

We can model shifts in thresholds and prior expectations by assuming that there exist param-

eters λt and λm that shift the likelihood ratios of those distributions. This creates families

ordered in the monotone likelihood ratio (MLR) sense, which we denote by t (θ |λt) and

m (µ |λm).17 We also want our likelihood ratio shifters to move their distributions arbi-

trarily high or low, so we set their domain to (−∞,∞) and assume, for any µ or θ, that

limλt→−∞

T (θ |λt) = limλm→−∞

M (µ |λm) = 1 and limλt→∞

T (θ |λt) = limλm→∞

M (µ |λm) = 0. These

properties will allow us to more easily derive the comparative statics of increasing thresholds

and beliefs.18

3.2.1 Increasing thresholds

Everything we have discussed thus far is invariant to an increasing transformation of λt.

As such, for this subsection, we will de�ne our likelihood ratio shifter such that λt =

E [µ | θ = µ; λt]. Looking to Theorem 2, this means that τ (λt) ∝ E [t (µ |λt)] · (ν − λt),where the constant of proportionality consists of terms that don't vary in λt. Given this

setup, we already can already say a lot about the general shape of the function. Speci�cally,

Proposition 2. The following are properties of τ (λt):

1. limλt→−∞

τ (λt) = limλt→∞

τ (λt) = 0.

2. When λt < ν, τ (λt) > 0, and when λt > ν, τ (λt) < 0.

3. If both m (µ) and t (θ |λt) are log-concave, then the magnitude of the treatment e�ect

has exactly two peaks: one in the negative treatment e�ect range and one in the positive

treatment e�ect range.

See the appendix for a full proof of the theorem. Intuitively, Part 1 holds because as λt

approaches ±∞, the main mass of thresholds is so far from the main mass of beliefs, that

there is a vanishing fraction of the population that is marginal. Part 2 is obvious, since

sgn {τ (λt)} = sgn {ν − λt}. Part 3 makes use of a common restriction on distributions: that

17A family m (µ |λm) is ordered in the MLR sense if, for any λ′m > λm and µ′ > µ,m(µ′ |λ′m)m(µ |λ′m) >

m(µ′ |λm)m(µ |λm) .

This property is commonly used in conjunction with modeling Bayesian updating. Important for our purposesis that it implies �rst-order stochastic dominance, that is, M

(µ∣∣λ′µ) < M (µ |λµ).

18Technically, these assumptions allow us to use the shifters to move the mean prior expectation of marginalagents across the entire range (−∞,∞). See Lemma 6 in the appendix for a formal proof.

16

its natural logarithm is concave. Many commonly used distributions have this property,

including the uniform, normal, logistic, extreme value, and Laplace distributions (Bagnoli

and Bergstrom 2005). Essentially, log-concavity acts as a strong form of single-peakedness

for positive functions.19 Another good reason to think that log-concavity �ts well into our

framework is the fact that translating a distribution to the right increases it in the monotone

likelihood ratio sense if and only if it is log-concave.20 Once we have log-concavity, the single-

peakedness of the treatment e�ect on its positive and negative ranges is a straightforward

application of the fact that integrating an argument out of a log-concave function leaves

behind a log-concave function (Prékopa 1973).

3.2.2 Increasing beliefs

Now, we consider what happens when we shift λm, holding λt constant. As before, we de�ne

our shifter such that λm = E [µ | θ = µ; λm], leaving us with τ (λm) ∝ E [m (θ |λm)]·(ν − λm).

By logic similar to that of the previous section, we can characterize the shape of τ (λm).

Speci�cally,

Proposition 3. The following are properties of τ (λm):

1. limλm→−∞

τ (λm) = limλm→∞

τ (λm) = 0.

2. When λm < ν, τ (λm) > 0, and when λm > ν, τ (λm) < 0.

3. If both m (µ |λm) and t (θ) are log-concave, then the magnitude of the treatment e�ect

has exactly two peaks: one in the negative treatment e�ect range and one in the positive

treatment e�ect range.

So, the treatment e�ect as a function of the shifters looks the same for both: it starts at zero,

increases to a peak, then decreases to zero, then decreases to a trough, and then increases

to zero again.

3.3 Proxying for shifters using the baseline take-up rate

Although the comparative statics of Subsections 3.2.1 and 3.2.2 are interesting, it would

be di�cult to take them to the data, as the shifters correspond to the expected threshold

19To see that log-concavity implies single-peakedness, note that ∂2 log f(x)∂x2 < 0 and f > 0 imply that

f ′′ < f ′2

f . At an optimum, f ′ = 0, which means that f ′′ < 0, the second-order condition for a localmaximum. And if all local optima are maxima, then there can only be one local optimum. To see howlog-concavity is a strengthening of single-peakedness, note that Ibragimov (1956) shows that a function islog-concave if and only if, when convolved with another single-peaked function, the result is single-peakedas well.

20This is a straightforward consequence of the fact that ∂2 log f(x−t)∂x∂t = −∂

2 log f(x−t)∂x2 .

17

and expected belief of the marginal agent, and often, who is marginal is di�cult to surmise.

Fortunately, we will be able to proxy for our shifters with the baseline take-up rate, which

is given by the following two related-through-integration-by-parts expressions:

β (λt, λm) =

ˆT (µ |λt) ·m (µ |λm) · dµ,

=

ˆt(θ∣∣∣λt) · [1−M (

θ∣∣∣λm)] · dθ.

Intuitively, the �rst expression comes from the fact that for any belief µ, the fraction of

agents whose threshold is lower is T (µ |λt). The second expression has similar intuition.

The baseline is able to act as a proxy for the shifters because, for �xed λm, the baseline runs

monotonically from 0 to 1 as λt decreases (and for �xed λt, it runs monotonically from 0 to

1 as λm increases.21 Hence, when only λm or only λt vary, there is a one-to-one mapping

between shifter and baseline, which means we can restate our comparative statics in terms

of the treatment e�ect as a function of the baseline take-up rate, τ (β).

Theorem 3. If changes in β are being driven by changes in the threshold shifter λt, then

the following are properties of τ (β):

• There exists a β0 such that τ (β) ≤ 0 when β ≤ β0, and τ (β) ≥ 0 for β ≥ β0.

• If both m (µ) and t (θ |λt) are log-concave, then the magnitude of τ (β) has exactly two

peaks: one in the negative treatment e�ect range and one in the positive treatment

e�ect range.

If changes in β are being driven by changes in the prior expectation shifter λm, then the

following are properties of τ (β):

• There exists a β0 such that τ (β) ≥ 0 when β ≤ β0, and τ (β) ≤ 0 for β ≥ β0.

• If both m (µ |λm) and t (θ) are log-concave, then the magnitude of τ (β) has exactly

two peaks: one in the negative treatment e�ect range and one in the positive treatment

e�ect range.

Either way, limβ→0

τ (β) = limβ→1

τ (β) = 0.

This theorem provides a natural way to gain some simple intuition into when we expect

the treatment e�ect to be large or small. The general shape of the curve suggested by the

theorem can be seen in Figure 1.

21See Lemma 9 in the appendix for a formal proof.

18

Figure 1: τ (β) when changes in β are driven by changes in λt.

4 Making the comparative statics more precise

Often, in addition to simple comparative statics, we might want to have some idea of the

baseline take-up rate at which τ (β) goes from negative to positive, or the point at which a

positive e�ect goes from increasing to decreasing. To get precise answers to such questions,

we must make more speci�c functional form assumptions, and supplement them with belief

surveys. In this section, we will discuss how to model treatment e�ect with belief surveys of

di�erent coarseness of information.

4.1 The normal model

Assume that µ and θ are independent and normally distributed with means E [µ] and E [θ]

and variances Var [µ] and Var [θ]. Further assume that changes in β are driven by translations

in the threshold distribution (i.e. the distribution remains normal with the same variance

but a di�erent mean). We call this setup the normal model.

Theorem 4. In the normal model, the treatment e�ect as a function of the baseline is given

by

τ (β) = ε2 · E[σ2]· 1√

1 + η2· ϕ(Φ−1 (β)

)·

{z (ν) +

1√1 + η2

· Φ−1 (β)

},

where z (ν) ≡ ν−E[µ]√Var(µ)

is the z-score of the signal on the distribution of µ, and η ≡√

Var[θ]Var[µ]

19

is the ratio of the standard deviations of θ and µ. Further, since distributions in the normal

model are log-concave, Theorem 3 comes to bear, which means that we expect one zero, one

minimum, and one maximum. The zero of τ (β) is at

β0 = Φ(−√

1 + η2 · z (ν)),

while the minimum β− and the maximum β+ of τ (β) are at

β± = Φ

1

2· Φ−1 (β0)±

√(1

2· Φ−1 (β0)

)2

+ 1

.

See the appendix for a derivation. This functional form is actually what was plotted in

Figure 1. In many situations, η is an unknown parameter. Fortunately, our formulae still

tell give us bounds on where the extrema are located, and the ranges on which τ (β) is

monotonic. In the main text, we will present bounds that only require knowing whether the

signal is �good news� or �bad news� relative to the population distribution of beliefs. In the

appendix, we derive similar bounds for a known value of z (ν).

Proposition 4. In the normal model, when the signal is �good news�, that is, z(ν) > 0, we

know that, β0 ∈(0, 1

2

), which means that

• β+ ∈(

12,Φ (1) ≈ 0.841

)and β− ∈ (0,Φ (−1) ≈ 0.159),

• τ (β) is increasing on β ∈(0.159, 1

2

)and decreasing on β ∈ (0.841, 1).

Similarly, when the signal is �bad news�, that is, z(ν) < 0, we know that β0 ∈(

12, 1), which

means that

• β+ ∈ (0.841, 1) and β− ∈(0.159, 1

2

),

• τ (β) is decreasing on β ∈ (0, 0.159) and increasing on β ∈(

12, 0.841

).

The proof hinges on the fact that the extrema β± are both monotonically increasing in the

location of the zero, β0.22 See the appendix for a proof. Note that this theorem give very

stark results about whether τ (β) is increasing or decreasing on certain ranges. This can be

particularly useful as a guide for practitioners who are trying to determine whether to expect

22To see this, note that the derivative of the extrema locations is ∂β±∂β0

=

ϕ

(12 · Φ

−1 (β0)±√(

12 · Φ−1 (β0)

)2+ 1

)· 1

2 ·

(1±

12 ·Φ−1(β0)√

( 12 ·Φ−1(β0))

2+1

)· ∂Φ−1

∂β0(β0). None of the terms

in that product can be negative.

20

larger or smaller treatment e�ects in subgroups that have di�erent values for the baseline,

β.

For similar reasons, we might also care about what baselines should yield a larger treat-

ment e�ect than a given, reference baseline. De�ne the upper contour set of the treatment

e�ect as Υ (β | η) ={β : τ

(β∣∣∣ η) ≥ τ (β | η)

}and the lower contour set as Λ (β | η) ={

β : τ(β∣∣∣ η) ≤ τ (β | η)

}. These set-valued functions take a baseline β and return the set

of other baseline take-up rates at which we expect a treatment e�ect as big or small as it is

at β. In the normal model, these functions can be written as simple intervals since, as we

saw in Theorem 3, both τ (β) and −τ (β) are both single-peaked on the ranges that they are

positive. Further, we can derive inner bounds for these sets when η is unknown.

Theorem 5. De�ne b (η) implicitly as the solution to τ (b (η) | η) = τ (β | η) that isn't β.

Then:

• if the signal is good news (i.e. z (ν) > 0), and the treatment e�ect is positive and

decreasing at baseline take-up rate β, then Υ (β | η) = [b (η) , β], where b (η) is decreasing

in ηfor η ≥ 0 and hence Υ (β | 0) is an inner bound, that is Υ (β | 0) ⊆ Υ (β | η).

• if the signal is bad news (i.e. z (ν) < 0), and the treatment e�ect is negative and

decreasing at baseline take-up rate β, then Λ (β | η) = [β, b (η)], where b (η) is increasing

in ηor η ≥ 0 and hence Λ (β | 0) is an inner bound, that is Λ (β | 0) ⊆ Λ (β | η).

When η is unknown, Theorem 5 provides guidance about when a treatment e�ect will be

larger in magnitude. A conservative estimate can be obtained by setting η to be at the low

end of what is expected (see Co�man, Featherstone and Kessler (2014) for an application of

this approach with η = 1). Absent any information, η = 0 is the most conservative bound.

4.2 Belief and threshold surveys

If we have a belief survey that gives us m (µ), then we can simply guess some family for

t (θ |λt). From there, we can numerically integrate both the baseline take-up rate and

the treatment e�ect as a function of λt. If t (θ |λt) is ordered in the monotone likeli-

hood ratio sense, then λt simply serves to to parametrize the relationship τ (β), since

β (λt) =´ [

1−M(θ)]· t(θ∣∣∣λt) · dθ is a monotonically decreasing function. Moreover,

if we have a survey that provides the threshold distribution, then we can numerically inte-

grate treatment and baseline take-up rate in the same way, except now we will only need to

make an assumption about how λt moves the empirical distribution of thresholds. Simple

translation seems like the most likely option.

21

5 Interpreting the comparative statics

The comparative statics described in sections 3 and 4 provide intuition about how we expect

treatment e�ects to vary across groups with di�ering baseline take-up rates. In this section,

we use this intuiton to test our model with existing empirical results from the literature

on information nudges. (Note: Our current meta-analysis is limited to 62 data points from

13 papers, which we found in an initial search of the literature, we are in the process of

getting more data and will rerun the analysis upon collecting it all.) While the papers span

a variety of domains and interventions, the results are largely consistent with our theory.

Consequently, we are able to reconcile results across the literature on information nudges

and highlight why some attempts at a�ecting behavior with informaton fail to achieve the

desired goal.

5.1 Within and across experiments

Many empirical papers report treatment e�ects of their intervention across multiple sub-

groups. Our theory makes sharp predictions about which subgroups will display larger

treatment e�ects. Across subgroups in the same experiment, the realization and information

content of the signal remains constant, so the relevant comparative statics are con�dence,

summarized by E [σ2], threshold (e.g. quality of outside option), summarized by t (θ |λt),and optimism, summarized by β (µ |λm). For the �rst, we would expect treatment e�ect to

be independent of baseline take-up, since the baseline does not depend on E [σ2]. For the

second, we would expect a relation like that in Figure 1. For the third, we would expect

a similar relationship, but that went positive and then negative as baseline increased. In

practice, most subgrouping in empirical papers is done based on demographic characteristics.

Unless we have an ex ante reason to think that a demographic group would be particularly

optimistic or pessimistic about the value of take-up (e.g. if men and women were given

feedback about the likelihood of winning a tournament in which men's beliefs are overly op-

timistic relative to those of women), it seems safe to assume that when di�erent demographic

groups have di�erent take-up rates it is caused by di�erences in preferences for the action

relative to the outside option (e.g. because of di�erent tastes for the action or di�erent out-

side options). (With more data, one could imagine only using within-experiment variation

to test the model. At the moment, we must also rely on across-experiment variation, which

requires additional assumptions described below.)

Across experiments, baseline take-up rates di�er along with setting and associated nudge.

Consequently, when we think about moving from one setting (e.g. with a lower baseline) to

another setting (e.g. with a higher baseline) it is less clear whether we should think about

22

individuals in the latter group havingworse outside options (i.e. lower thresholds) or being

more optimistic about the value of the action (i.e. higher beliefs). For our theory to make

predictions about how baseline correlates with treatment e�ects across experiments, however,

we have to take a stand on whether thresholds or beliefs are shifting across experiments.

Fortunately, the type of experimental settings researchers select can act as a guide. Because

the standard intution is that information interventions work when nudges are higher than the

average beliefs, we expect most experiments that are actually undertaken to provide signals

that are relatively far to the right of the belief distribution. So while there is variation in

how far to the right the signal is relative to average beliefs, we can assume that, across

experiments, if the belief distribution moves to the right, so does the signal. Further, if

we think about the model thus far, beliefs, thresholds, and signals are only de�ned relative

to each other. That is, if we translate all of these quantities to the right, nothing changes.

Hence, broadly speaking, shifting the belief distribution and the signal in lock-step is roughly

equivalent to shifting the threshold distribution in the opposite direction. As such, even

across papers, we have reason to think that the threshold shifter comparative static is the

best guide, which allows us to use the baseline variation across experiments in the same

way we use it within experiments. In the next section, we will put this logic to the test by

plotting treatment e�ects against baseline take-up rates for all subgroups from a sample of

papers from the empirical nudge literature.

5.2 Empirical support

As a �rst check of the theory, we investigate if the functional relationship between baseline

rates and treatment e�ects is consistent with predictions from the theory. To simplify the

analysis, we assume that the papers in our data set nudge with information that is high

relative to the distribution of beliefs held by subjects. This is not a strong assumption, as

the papers in our analysis all intended to increase beliefs about the utility of the desired

outcome. Looking back to our theory, we predict that there will be weakly negative treatment

e�ects at very low baselines, and then an inverted U-shape of positive treatment e�ects at

other baselines.23 As we will discuss later, our current data is under-powered to test the

�rst part of this prediction. Consequently, in what follows we explore the second part of

this prediction by looking for an inverted U-shape with treatment e�ects increasing and then

decreasing with the baseline.

23Essentially, we expect a relation akin to that in Figure 1. To see this more formally, consider the normalmodel from Section 4.1. Theorem 4, shows that the more the nudge is large relative to the belief distributionof subjects (codi�ed by a large z (ν)), the closer the internal zero of the normal model, β0, is to zero. A bitmore algebra shows that as β0 → 0, the ratio of the depth of the trough of τ (β) to the height of the peakof τ (β) is zero.

23

This approach is not intended to be an iron-clad test of the theory, but rather a demon-

stration that the theory is consistent with the literature and serves to organize it. Further,

the data to which we have access are not ideal. First, without precise measurement and

reporting of every parameter in the formulas from Theorem 2, we cannot make precise pre-

dictions, e.g. the threshold for negative treatment e�ects, or the peak of the curve. Second,

and more importantly, the data set may be confounded by publication bias. Papers with an

insigni�cant or negative treatment e�ect may not ever be published, and as a result will not

make it into our analysis. This is problematic since our theory predicts bigger treatment

e�ects at intermediate baselines where variance of a binary variable is at its maximum � if

papers are only published with signi�cant results we might fail to see small treatment e�ects

at intermediate baselines. This concern is mitigated, however, since we have data reported

for secondary analyses (e.g. subgrouping or secondary outcomes) from published papers,

which are likely less prone to publication bias.

As dictated by our model, we analyze papers that satisfy three criteria: information

was experimentally provided, there was a binary outcome variable, and base rates in the

(no-information) control condition were reported. In all, we have 62 data points from 13

di�erent papers. We do not claim the papers collected are exhaustive. However, we have no

reason to believe these papers are not representative of the relationship between treatment

e�ects and baseline rates. For visual simplicity, we exclude three treatment e�ects with an

absolute value greater than 15 percentage points, leaving 59 data points to analyze.24

The average treatment e�ect in the sample is +3.0 percentage points (pp) with a median

of +2.7pp. The treatment e�eects vary widely, with a standard deviation of 3.1pp. However,

negative treatment e�ects comprise only eight of the 59 treatment e�ects. Seven of those

negative treatment e�ects are fairly insubstantial, with an absolute value less than two

percent, and six of the eight are within a standard error of zero. In short, the current

analysis will be underpowered to test which base rates, if any, produce negative treatment

e�ects.

Baseline take-up rates are skewed towards the low side of the unit interval, with a median

of 30% and an average of 37%. This may be by chance. It may also be that researchers

choose to experiment on low take-up rates, perhaps because this indicates a problem exists,

and there is much potential for improvement. Though this intuition is appealing, the lower

end of the unit interval is precisely where our theory is most pessimistic for substantial,

positive treatment e�ects, as shown in Figure 1.

24This restriction is mostly to avoid the noise introduced by large outliers, although some sort of restrictionof this form is supported by the theory, which is only valid for �nudges� � implying modest changes in beliefsand behaviors. Results are qualitatively consistent when these three data points are included.

24

-.05

0.0

5.1

.15

Siz

e of

Tre

atm

ent E

ffect

0 .2 .4 .6 .8 1Take-up Rate in Control Group

Allcott & Taubinsky (2015) Avitabile & de Hoyos (2015)

Bettinger et al (2011) Cai et al (2009)

Clark et al (2013) Coffman et al (2015)

Del Carpio (2014) Hastings et al (2015)

Hastings & Weinstein (2008) Jensen (2010)

Kuziemko et al (2015) Karadja et al (2014)

Nguyen (2008)

Each dot represents the treatment e�ect for a group with a particular baseline take-up rate. Papers identi�ed bycolor. Marker size is weighted by inverse standard error. To facilitate viewing, three treatment e�ects with absolutevalue greater than 0.15 are excluded from the analysis.

Figure 2: Treatment e�ect sizes (pp) by baseline take-up rates (%)

To give a raw look of the data, Figure 2 is a scatter plot of treatment e�ect sizes on the

vertical axis and baseline take-up rates along the horizontal axis. The size of the markers

is weighted by the inverse standard error of the treatment e�ect; larger dots represent more

precise estimates. Visually, the data roughly form the inverted U-shape predicted by the

theory. Going left to right, treatment e�ects start out quite small, then generally increase,

then decrease.

Regression analysis supports the visual. We run ordinary least squares regressions to �nd

the best-�t quadratic,

τ = a+ b1 · β + b2 · β2 + error,

weighting observations by the inverse of the squared standard error of the treatment e�ect

(Model I) or clustering at the paper level (Model II). Table 1 displays the results. The coef-

�cient b1 is positive and signi�cant, b2 is negative and signi�cant, and a is indistinguishable

from zero in both models. The estimated relationship is an inverted-U-shape, as base rates

increase, treatment e�ect sizes increase and then decrease. This can be seen visually in

25

D.V. = Size of Treatment E�ectI II III IV

Data restriction None β ∈ [0.159, 0.5]β (baseline take-up rate) 0.23 0.23 0.13 0.13

(0.03)*** (0.04)*** (0.04)*** (0.04)***β2 -0.26 -0.26

(0.04)*** (0.04)**Constant -0.00 -0.00 0.00 0.00

(0.01) (0.01) (0.01) (0.04)Inverse variance weighting? Yes No Yes NoClustered by paper? No Yes No YesNum. Results 59 59 34 34Num. Papers 13 13 9 9

*, **, *** denotes signi�cance at 0.1, 0.05, and 0.01, respectively.

Both baseline and treatment e�ect are measured in percentage points. For instance, the 0.23 coe�cient for β underModel I should be interpreted as a 0.23 percentage point change in treatment e�ect for a 1 percentage point changein baseline.

Table 1: The relationship between baseline and treatment e�ect.

Figure 3, which shows a smoothed local polynomial regression of the data.

Further, the models are in line with the more precise predictions made in Proposition 4,

which states that if the information provided is �good news�, i.e. it is greater than the mean

belief of the population, then in the normal model, the treatment e�ect is increasing with

baseline take-up rates on the range β ∈ (0.159, 0.5) and decreasing with baseline take-up

rates on the range β ∈ (0.841, 1). We only have one data point in (0.841, 1), so we focus on

the �rst interval. Indeed, as can be seen in Figure 3, treatment e�ects are increasing over

(0.159, 0.5). Further, Models III and IV of Table 1 estimate a linear �t of data from that

range, and estimates a signi�cant and substantial increase in treatment e�ects with respect

to baseline take-up rates. Results suggest a 1.2pp increase in treatment e�ect for every ten

point increase in baseline rate over that range (p < 0.01 in both regressions).

Overall, the data provide strong support for the theory. Due to current data limitations,

we are not able to test what, if any, baseline take-up rates predict negative treatment e�ects.

However, the data show strong evidence for the inverted-U relationship between baseline

take-up rates and treatment e�ects. Empirically, the intuition that low baseline take-up

rates may be an ideal environment for experimenting might be outweighed by the factors

identi�ed in our model.

26

0.0

1.0

2.0

3.0

4.0

5S

ize

of T

reat

men

t Effe

ct

0 .2 .4 .6 .8 1Take-up Rate in Control Group

The con�dence bands are drawn at the 95% con�dence level.

Figure 3: Estimated Relationship between Baseline Take-up and Treatment E�ect

6 Extensions to the model

Although the paper has thus far focused on the e�ect of revealing information to near-risk-

neutral agents making a single binary choice, our framework can be applied to a variety of

related settings. In this section, we elucidate �ve such extensions, focusing on additions to

the model that we think are particularly relevant for both practitioners and for academics

interested in understanding how nudges can a�ect behavior and, as in subsection 6.1, about

how such nudges e�ects welfare � a topic of growing interest (see, e.g., Allcott and Kessler

(2015)).

6.1 Welfare

Our basic model allows us to compute the treatment e�ect of the information nudge on

welfare, τW . To compute τW , note that only agents whose behavior is changed by the

nudge contribute, that is, agents with thresholds in the range θ ∈ [µ, µ+ σ2 · (ν − µ) · ε2].

Of those agents with some prior expectation µ, the fraction that are persuaded is hence

t (µ) · σ2 · (ν − µ) · ε2. Each of those agents ultimately prefers take-up to abstention by

27

κ · (x− θ), where κ ≡ u′ (E [µ]).25 So, for a particular value of µ, the expected welfare

change is t (µ) ·σ2 · (ν − µ) ·ε2 ·κ · (x− θ). Note that while behavior is determined by ex anteexpectations about x, welfare is determined by the value of x that is ultimately realized.

The overall e�ect on welfare can be derived by taking the expectation of our expression over

µ.

Theorem 6. For a given realization of x, the ex post treatment e�ect of the information

nudge ν on welfare is

τW (ν) = ε2 · E [κ] · E[σ2]· E [t (µ)] · (x− E [µ | θ = µ]) · (ν − E [µ | θ = µ]) ,

which is negative when the signal ν and the actual information x are on di�erent sides of the

expected belief of the marginal agent, E [µ | θ = µ]. However, the ex ante (i.e. x is known,

but not ν) treatment e�ect of the nudge on welfare is positive and equal to

E[τW (ν)

]= ε2 · E [κ] · E

[σ2]· E [t (µ)] · (x− E [µ | θ = µ])2 .

The two parts of the theorem illustrate the subtlety of the intuition that providing infor-

mation to agents can only help decision making and thus cannot make agents worse o�.

In expectation, a signal of x helps; however, there exist realizations of the signal that hurt

agents overall.

6.2 Attrition

In many binary choice situations, there is a time between the original choice and the realiza-

tion of x in which takers can renege or attrit.26 One reason for takers to attrit is the arrival

of new alternatives that are preferred to taking-up. In this sort of scenario, we should worry

not just about takers who were marginal about taking-up originally, but also about takers

who were infra-marginal at the time of original decision, but who might become marginal as

a new opportunity arrives.

To model this, assume that after taking-up, takers receive a new outside option, codi�ed

by a new threshold θN , distributed according to the density tN(θN). This new outside option

will cause a taker to renege if it is greater than his belief, µ. So, our agents who attrit have

25Intuitively, an agent's κ is the utility change she experiences due to a unit increase in x.26For example, Co�man, Featherstone and Kessler (2014) investigate the e�ect of an information nudge

that tells new admitted applicants of Teach For America about the program's matriculation rate in theprevious year. In this setting, potential teachers say yes to the program months before they actually startteaching (and often also many months before they start training) and so practitioners are concerned thatpotential teachers might say yes and then subsequently decide they would rather work elsewhere.

28

θ < µ < θN . The number of such agents is β −´T (µ) · TN (µ) ·m (µ) · dµ, and hence the

attrition rate is 1−´T (µ)·TN (µ)·m(µ)·dµ

β. Now, treatment will prevent attrition in agents with

θ < θN and θN ∈ (µ, µ+ σ2 · (ν − µ) · ε2). As before, we can multiply the width of this range,

σ2 · (ν − µ) · ε2, by the density of agents on this range, yielding T (µ) · tN (µ) ·σ2 · (ν − µ) · ε2.

Taking the expectation across the population, we can get the e�ect of the nudge on the

attrition rate.

Theorem 7. The e�ect of the nudge on the attrition rate is

−E[σ2]· E[tN (µ)

∣∣µ ≥ θ]·(ν − E

[µ∣∣ θ ≤ µ = θN

])· ε2.

Note that a positive treatment e�ect on taking-up does not guarantee a positive treat-

ment e�ect on attrition rate. For instance, if the new alternative o�er is just another draw

from the original threshold distribution, then it is simple to show that E[µ∣∣ θ ≤ µ = θN

]≥

E [µ | θ = µ].27 In this case, it is possible to have a nudge that induces a positive e�ect on

take-up, but a negative e�ect on attrition rate. Hence for signals that just barely induce a

positive e�ect on take-up, it is sensible for the policy-maker to at least be cognizant of the

possibility that the intervention might induce a higher attrition rate. The �ip side of this is

that for high enough signals, the nudge induces more take-up, through its action on marginal

takers, and less attrition, through its e�ect on infra-marginal takers.

To make this concern more concrete, we can derive the treatment e�ect on attrition in

the normal model with the new alternative o�er being just another draw from the original

threshold distribution.

Proposition 5. In the normal model where tN (θ) = t (θ), the treatment e�ect on attritionis

τA = −ε2 · E[σ2]·

1√1 + η2

· ϕ(Φ−1 (β)

)· Φ(

η · Φ−1 (β)√(1 + η2) · (2 + η2)

)

·

z (ν) +1√

1 + η2· Φ−1 (β)−

η√(1 + η2) · (2 + η2)

·ϕ

(η√

2+η2· Φ−1 (β)

)Φ

(η√

(1+η2)·(2+η2)· Φ−1 (β)

),

,

where z (ν) and η are de�ned as in Theorem 4.

Consider η = 1 and β = 50%. According to Theorem 4, any z (ν) ≥ 0 will induce a positive

treatment e�ect on baseline take-up. But Proposition 5 tells us that the treatment e�ect on

27To show this, we simply look at the likelihood ratio:m(µ | θ≤µ=θN)m(µ | θ=µ) ∝ m(µ)·tN (µ)·T (µ)

m(µ)·t(µ) , where the con-

stant of proportionality does not depend on µ. When tN (µ) = t (µ), we get that the likelihood ratio isproportional to T (µ), which is clearly increasing. From there, the monotone likelihood ratio implies thatm(µ∣∣ θ ≤ µ = θN

)�rst-order stochastically dominates m (µ | θ = µ).

29

attrition rate is only positive if z (ν) ≥ 0.33. Any z (ν) between 0 and 0.33 will induce a

higher baseline take-up rate, but also a higher attrition rate.

6.3 Nudges that reassure

In the basic model, we allowed the nudge signal to be arbitrarily far from agents' prior

expectations, that is, we didn't assume that ν − µ was small. This led to our theory being

dominated by how the nudge directly shifted the expectation. In this section, we will consider

what other forces come into play when ν − µ is small as well. Ultimately, this alternative

modeling assumption demonstrates the potential value of a �reassuring nudge� that simply

increases an agent's con�dence in their prior beliefs � something like �most people have a

more accurate guess at x than they think.�28

Mathematically, let ν − µ is of the same order as α1, γ1, γ2, and ε2. If this is the case,

then we can essentially follow the same logic as before (cf. the end of Section 2.3 and the

beginning of Section 2.5) to derive an expression for the treatment e�ect under our new

assumptions.

Theorem 8. When ν − µ is of the same order as α1, γ1, γ2, and ε2, to leading order, the

treatment e�ect is given by

τ = E [t (µ)] ·{E[σ2]· (ν − E [µ | θ = µ])− 1

2· E[σ3]· E [γ1] +

1

2· E[σ4]· E [α1]

}· ε2.

Essentially, we now have two �extra� forces driving the treatment e�ect, in addition to the

one discussed in Section 2.5. The force proportional to α1 comes from the fact that the

nudge decreases the second moment of the posterior, which leads to a risk-averse agent

having a higher expected utility from taking-up. This force always leads to a bump in

take-up, so long as agents are risk-averse. The force proportional to γ1 comes from the fact

that decreasing the second moment, in addition to a�ecting utility through risk aversion,

also means that tail events are less important. If the agent is worried about left-tail events

(γ1 < 0), then the nudge reassures her that those events are unlikely, leading to a bump in

take-up. Conversely, if the agent is banking on right-tail events (γ1 > 0), the nudge convinces

her that such occurrences are unlikely, leading to a drop in take-up.

28We do not know of any examples of this type of nudge having been analyzed in the literature, butwe see reassuring motives in campaigns designed to encourage people who think they might have observedsuspicious behavior to �see something, say something� and in perennial appeals by teachers to students that�if you have a question, you're not the only one.�

30

6.4 Continuous choices

While this paper has focused on binary choices, in many relevant settings where information

nudges are employed agents have the opportunity to make continuous choices. We can use

the same framework developed in this paper to analyze how information nudges a�ect these

continuous choices.

Instead of a binary choice, now let agents choose an action (a continuous variable a) to

maximize the expectation of a utility function that also depends on x. Expanding utility in

a Taylor series about (a0,E [µ]), where a0 is the action that would be taken by an agent with

µ = E [µ], we get

u (a, x) ≈ u (a0,E [µ]) + u1 (a0,E [µ]) · (a− a0) + u2 (a0,E [µ]) · (x− E [µ])

+1

2· u11 (a0,E [µ]) · (a− a0)2 + u12 (a0,E [µ]) · (x− E [µ]) · (a− a0)

+1

2· u22 (a0,E [µ]) · (x− E [µ])2 .

If we assume that the prior is near-normal and that further terms in the expansion are O (ε2)small, then, to leading order, the agent maximizing E [u (a, x) | ν] is solving

maxa

{u1 (a0,E [µ]) · (a− a0) +

1

2· u11 (a0,E [µ]) · (a− a0)

2

+ u12 (a0,E [µ]) ·[µ− E [µ] + σ2 · (ν − µ) · ε2

]· (a− a0)

}.

In this scenario, the treatment e�ect is simple to derive. We de�ne the responsiveness by

ρ ≡ −u12(a0,E[µ])u11(a0,E[µ])

, and by the same logic that supports the common information acquisition

assumption, we assume that ρ is independent of the other population parameters.

Theorem 9. The average change in action (i.e. the treatment e�ect) is

τ = E [ρ] · E[σ2]· (ν − E [µ]) · ε2.

Note that for continuous choices, the signal at which the e�ect goes from negative to positive

is the average prior expectation across the entire population. Contrast this with the binary

choice model, where the switching point is the average prior expectation of the marginal

admit. The di�erence is intuitive: for continuous choice, we care about the change in utility

for all agents, because it leads to a change in behavior for all agents. For binary choice,

although the utility of all agents is changed, we only care about the change for the smaller

set of agents for whom it will change their behavior, that is, we only care about the utility

change for marginal agents. To further emphasize the di�erence, note that the computation

31

done in this section would look almost identical to trying to compute the average change in

utility in the binary choice model, that is, in the binary choice model, the average change in

utility switches signs as ν crosses E [µ], while the treatment e�ect on take-up changes sign

as ν crosses E [µ | θ = µ].

6.5 Strategic signal revelation

Given that nudges can successfully in�uence behavior of agents, we now turn brie�y to how

practitioners who have an incentive to get agents to take an action would optimally utilize

information nudges to achieve their goals.

We consider a model where the �rm can choose to coarsen the information it reveals,

but there is no leeway for lying, similar to Kamenica and Gentzkow (2011). A revelation

strategy of this sort can be represented by a partition of the real line where, instead of

revealing the signal, the �rm reveals which element of the partition the signal is in. We call

the strategy full revelation whenever each element of its partition is a singleton. First,

note that if the �rm cannot commit to a revelation strategy, then the only equilibrium is full

disclosure. This is because whenever a signal is realized that is near the upper boundary of

a partition element, the �rm would prefer to fully reveal the signal. But if agents know this,

then when the �rm does not fully reveal, they infer that the signal is not near the top of

the partition element. The only revelation strategy that doesn't unravel in this way is full

revelation. But what if the �rm can commit? This turns out not to help either.

Theorem 10. Even if the �rm can commit to a revelation strategy, it cannot make the

treatment e�ect any higher than with full revelation.

Intuitively, to leading order, agents who know only that ν ∈ Π will act as if they received a

deterministic signal that maximizes the likelihood of the even ν ∈ Π. Further, any disagree-

ment about this signal between agents will be O (ε2), and hence can be ignored to leading

order, since it is already multiplied by ε2 in the formula for τ anyway. So all agents act

as if they received the same signal. What's more, this is also what the principal expects ν

to be, conditional on ν ∈ Π, for the same reasons. So, in expectation, full revelation and

coarser revelation strategies yield the same treatment e�ect, to leading order. Notice how

this depends on the linearity of τ in ν, which is reminiscent of the results Kamenica and

Gentzkow (2011), speci�cally Remark 1.

Two points immediately follow from the previous analysis. First, one could imagine that

some agents are �naive�, that is, if the �rm says �nothing� then they do nott update their

beliefs. In this world, if the signal would induce a positive treatment e�ect, it is clear that

the �rm does best to reveal it. If the signal would induce a negative treatment e�ect, then

32

the �rm does best to say �nothing�, as it keeps the marginal naive agents matriculating, but

in expectation is no di�erent from full revelation in terms of the treatment e�ect on the

sophisticated agents. Consequently, in these settings, �rms would only utilize information

nudges that revealed good news about the desired action, a strategy that appears to be

utilized in practice.

Second, if the principal were not risk-neutral about the treatment e�ect he ultimately

gets, then a coarser revelation strategy provides insurance without costing anything in terms

of expected treatment e�ect. This reasoning could help to explain how �rms deal with a

exogenously provided revelation strategies. For instance, cities that require a restaurant to

publicly display its �letter grade� following health inspections. If there exists �ner information

about the restaurant, a �rst intuition says that a restaurants with a �B� should want to report

that it actually got a �B+�. Of course, in doing so, sophisticated customers realize that in

other months where it doesn't report �B+�, it must have actually gotten a �B� or �B-�. By

establishing a policy of never revealing more than the city forces it to, the restaurant insures

itself against a �B-� by not taking the added payo� of a �B+� when it can.

7 Conclusion

As nudges become more prominent in the academic literature and more common as a policy

tool, there is an enhanced interest in understanding why nudges work, when they will be

successful (see, e.g., Beshears et al. (2015b)), and their welfare implications (see, e.g., Allcott

and Kessler (2015)). Central to this exercise is developing models of these nudges that can

give insight into the underlying mechanisms that can guide us to answers of how nudges

might work and who will be persuaded by them.

In this paper, we introducing a theory of information nudges that allows for Bayesian

updating in a setting of binary choice. Our model highlights that in these settings, the rel-

evant question about the sign and of the treatment e�ect is whether the information nudge

provides �good news� about taking the action to agents at the margin. This �nding highlights

the potential value of eliciting beliefs from agents � particularly of those agent at the mar-

gin. Our model additionally suggests, however, that baseline take-up rate in the untreated

group can be a useful proxy for the beliefs of marginal agents. This allows researchers and

practitioners to infer the likely sign and magnitude of a treatment e�ect arising from an

information nudge even without information on beliefs. In a meta-analysis of academic �nd-

ings, we �nd that the relationship between treatment e�ect size and baseline take-up rate

matched the pattern predicted by the theory, allowing us to rationalize previously �puzzling�

results from the literature.

33

References

Allcott, Hunt. 2011. �Social norms and energy conservation.� Journal of Public Economics,

95(9): 1082�1095.

Allcott, Hunt, and Dmitry Taubinsky. 2015. �Evaluating behaviorally-motivated policy:

Experimental evidence from the lightbulb market.� American Economic Review, forthcom-

ing.

Allcott, Hunt, and Judd B. Kessler. 2015. �The Welfare E�ects of Nudges: A Case

Study of Energy Use Social Comparisons.� NBER Working Paper 21671.

Allcott, Hunt, and Todd Rogers. 2015. �The Short-Run and Long-Run E�ects of Be-

havioral Interventions: Experimental Evidence from Energy Conservation.� American Eco-

nomic Review.

Angeletos, George-Marios, and Alessandro Pavan. 2007. �E�cient use of information

and social value of information.� Econometrica, 75(4): 1103�1142.

Avitabile, Ciro, and Rafael E De Hoyos Navarro. 2015. �The Heterogeneous E�ect

of Information on Student Performance: Evidence from a Randomized Control Trial in

Mexico.� World Bank Policy Research Working Paper, , (7422).

Bagnoli, Mark, and Ted Bergstrom. 2005. �Log-concave probability and its applica-

tions.� Economic theory, 26(2): 445�469.

Beshears, John, James J Choi, David Laibson, Brigitte C Madrian, and Kather-

ine L Milkman. 2015a. �The e�ect of providing peer information on retirement savings

decisions.� The Journal of Finance, 70(3): 1161�1201.

Beshears, John, James J. Choi, David Laibson, Brigitte C. Madrian, and

Sean (Yixiang) Wang. 2015b. �Who Is Easier to Nudge?� Working Paper.

Bettinger, Eric P, Bridget Terry Long, Philip Oreopoulos, and Lisa Sanbon-

matsu. 2012. �The role of application assistance and information in college decisions:

Results from the h&r block fafsa experiment*.� The Quarterly Journal of Economics,

127(3): 1205�1242.

Bhargava, Saurabh, and Dayanand Manoli. 2013. �Why are bene�ts left on the table?

Assessing the role of information, complexity, and stigma on take-up with an IRS �eld

experiment.� Amer. Econ. Rev.

34

Bhargava, Saurabh, and Day Manoli. 2015. �Psychological Frictions and the Incomplete

Take-Up of Social Bene�ts: Evidence from an IRS Field Experiment.� American Economic

Review, 105(11): 1�42.

Bollinger, Bryan, Phillip Leslie, and Alan Sorensen. 2011. �Calorie Posting in Chain

Restaurants.� American Economic Journal: Economic Policy, 3(1): 91�128.

Cai, Hongbin, Yuyu Chen, and Hanming Fang. 2009. �Observational Learning: Evi-

dence from a Randomized Natural Field Experiment.� The American Economic Review,

99(3): 864.

Chambers, Christopher P, and Paul J Healy. 2012. �Updating toward the signal.�

Economic Theory, 50(3): 765�786.

Chen, Yan, F Maxwell Harper, Joseph Konstan, and Sherry Xin Li. 2010. �Social

comparisons and contributions to online communities: A �eld experiment on movielens.�

The American economic review, 1358�1398.

Cialdini, Robert B, Linda J Demaine, Brad J Sagarin, Daniel W Barrett, Kelton

Rhoads, and Patricia L Winter. 2006. �Managing social norms for persuasive impact.�

Social in�uence, 1(1): 3�15.

Cialdini, Robert B, Raymond R Reno, and Carl A Kallgren. 1990. �A focus theory

of normative conduct: recycling the concept of norms to reduce littering in public places.�

Journal of personality and social psychology, 58(6): 1015.

Clark, Robert L, Jennifer A Maki, and Melinda Sandler Morrill. 2014. �Can Sim-

ple Informational Nudges Increase Employee Participation in a 401 (k) Plan?� Southern

Economic Journal, 80(3): 677�701.

Co�man, Lucas C, Clayton R Featherstone, and Judd B Kessler. 2014. �Can Social

Information A�ect What Job You Choose and Keep?� Working Paper.

Cornand, Camille, and Frank Heinemann. 2008. �Optimal degree of public information

dissemination*.� The Economic Journal, 118(528): 718�742.

Croson, Rachel, and Jen Yue Shang. 2008. �The impact of downward social information

on contribution decisions.� Experimental Economics, 11(3): 221�233.

Del Carpio, Lucia. 2013. �Are the neighbors cheating? Evidence from a social norm

experiment on property taxes in Peru.� Work. Pap., Princeton Univ., Princeton, NJ.

35

Denuit, Michel M, and Louis Eeckhoudt. 2010. �A general index of absolute risk atti-

tude.� Management Science, 56(4): 712�715.

Featherstone, Clayton R. 2015. �Prior-Free Bayesian Updating.�

Fellner, Gerlinde, Rupert Sausgruber, and Christian Traxler. 2013. �Testing en-

forcement strategies in the �eld: Threat, moral appeal and social information.� Journal

of the European Economic Association, 11(3): 634�660.

Fischbacher, Urs, Simon Gächter, and Ernst Fehr. 2001. �Are people conditionally

cooperative? Evidence from a public goods experiment.� Economics Letters, 71(3): 397�

404.

Frey, Bruno S, and Stephan Meier. 2004. �Social comparisons and pro-social behav-

ior: Testing �conditional cooperation� in a �eld experiment.� American Economic Review,

1717�1722.

Gerber, Alan S, and Todd Rogers. 2009. �Descriptive social norms and motivation to

vote: Everybody's voting and so should you.� The Journal of Politics, 71(01): 178�191.

Goldstein, Noah J, Robert B Cialdini, and Vladas Griskevicius. 2008. �A room

with a viewpoint: Using social norms to motivate environmental conservation in hotels.�

Journal of consumer Research, 35(3): 472�482.

Hallsworth, Michael, John List, Robert Metcalfe, and Ivo Vlaev. 2014. �The be-

havioralist as tax collector: Using natural �eld experiments to enhance tax compliance.�

National Bureau of Economic Research.

Hastings, Justine, Christopher A Neilson, and Seth D Zimmerman. 2015. �The

e�ects of earnings disclosure on college enrollment decisions.� National Bureau of Economic

Research.

Hastings, Justine S, and Je�rey M Weinstein. 2007. �Information, school choice, and

academic achievement: Evidence from two experiments.� National Bureau of Economic

Research.

Ibragimov, Il'dar Abdullovich. 1956. �On the composition of unimodal distributions.�

Theory of Probability & Its Applications, 1(2): 255�260.

Jensen, Robert. 2010. �The (perceived) returns to education and the demand for school-

ing.� The Quarterly Journal of Economics, 125(2): 515�548.

36

Jessoe, Katrina, and David Rapson. 2014. �Knowledge Is (Less) Power: Experimental

Evidence from Residential Energy Use.� American Economic Review, 104(4): 1417�38.

Kamenica, Emir, and Matthew Gentzkow. 2011. �Bayesian Persuasion.� American

Economic Review, 101(6): 2590�2615.

Keser, Claudia, and Frans Van Winden. 2000. �Conditional cooperation and voluntary

contributions to public goods.� The Scandinavian Journal of Economics, 102(1): 23�39.

Kessler, Judd B, and Alvin E Roth. 2015. �Organ Allocation Policy and the Decision

to Donate.� American Economic Review.

Kimball, Miles. 1992. �,Precautionary Motives for Holding Assets.� In The New Palgrave

Dictionary of Money and Finance. , ed. Peter Newman, Murray Milgate and John Eatwell,

158�161. New York:Stockton Press.

Kimball, Miles S. 1990. �Precautionary Saving in the Small and in the Large.� Economet-

rica: Journal of the Econometric Society, 53�73.

Marshall, Albert W, Ingram Olkin, and Barry Arnold. 2010. Inequalities: theory of

majorization and its applications. Springer Science & Business Media.

Martin, Richard, and John Randal. 2008. �How is donation behaviour a�ected by the

donations of others?� Journal of Economic Behavior & Organization, 67(1): 228�238.

Milgrom, Paul R. 1981. �Good news and bad news: Representation theorems and appli-

cations.� The Bell Journal of Economics, 380�391.

Morris, Stephen, and Hyun Song Shin. 2002. �Social value of public information.� The

American Economic Review, 92(5): 1521�1534.

Nelsen, Roger B. 2013. An introduction to copulas. Vol. 139, Springer Science & Business

Media.

Nguyen, Trang. 2008. �Information, Role Models and Perceived Returns to Education:

Experimental Evidence from Madagascar.� M.I.T. Working Paper.

Pomeranz, Dina. 2015. �Taxation without information.� American Economic Review,

105(8): 2539�69.

Potters, Jan, Martin Sefton, and Lise Vesterlund. 2005. �After you � endogenous

sequencing in voluntary contribution games.� Journal of public Economics, 89(8): 1399�

1419.

37

Potters, Jan, Martin Sefton, and Lise Vesterlund. 2007. �Leading-by-example and

signaling in voluntary contribution games: an experimental study.� Economic Theory,

33(1): 169�182.

Prékopa, András. 1973. �Logarithmic concave measures and functions.� Acta Scientiarum

Mathematicarum, 34(1): 334�343.

Rohatgi, Vijay K, and Gábor J Székely. 1989. �Sharp inequalities between skewness

and kurtosis.� Statistics & probability letters, 8(4): 297�299.

Salganik, Matthew J, Peter Sheridan Dodds, and Duncan J Watts. 2006. �Exper-

imental study of inequality and unpredictability in an arti�cial cultural market.� science,

311(5762): 854�856.

Sancetta, Alessio, and Stephen Satchell. 2004. �The Bernstein copula and its applica-

tions to modeling and approximations of multivariate distributions.� Econometric Theory,

20(03): 535�562.

Schultz, P Wesley, Jessica M Nolan, Robert B Cialdini, Noah J Goldstein, and

Vladas Griskevicius. 2007. �The constructive, destructive, and reconstructive power of

social norms.� Psychological science, 18(5): 429�434.

Shang, Jen, and Rachel Croson. 2009. �A �eld experiment in charitable contribution:

The impact of social information on the voluntary provision of public goods.� The Eco-

nomic Journal, 119(540): 1422�1439.

Slemrod, Joel, Marsha Blumenthal, and Charles Christian. 2001. �Taxpayer response

to an increased probability of audit: evidence from a controlled experiment in Minnesota.�

Journal of Public Economics, 79(3): 455�483.

Sunstein, Cass R., and Richard H. Thaler. 2008. Nudge: Improving Decisions About

Health, Wealth, and Happiness. Penguin Books.

Taubinsky, Dmitry. 2014. �From intentions to actions: A model and experimental evidence

of inattentive choice.� Working Paper.

Vesterlund, Lise. 2003. �The informational value of sequential fundraising.� Journal of

Public Economics, 87(3): 627�657.

Wiswall, Matthew, and Basit Zafar. 2015. �Determinants of College Major Choice:

Identi�cation using an Information Experiment.� Review of Economic Studies, 82(4): 791�

824.

38

Mathematical appendix

A Proofs from Section 2

A.1 Conditional independence (Section 2.1)

Proof of Proposition 1. For a signal where ν possibly depends on both x and y, the posterior

over both x and y is

f (x, y | ν) =f (x, y) · n (ν |x, y)´ ´

f (x, y) · n (ν | x, y) · dx · dy.

Hence, the posterior marginal over x is

f (x | ν) =

´f (x, y) · n (ν |x, y) · dy´ ´f (x, y) · n (ν | x, y) · dx · dy

.

Now, ν ⊥ y | x tells us that the signal distribution doesn't depend on y, that is, n (ν |x, y) =

n (ν |x, y′) for all y, y′. Letting n (ν |x) denote this value then, we can simplify our expression

for the posterior marginal over x to

f (x | ν) =n (ν |x) ·

´f (x, y) · dy´

n (ν | x) ·´f (x, y) · dy · dx

.

Since the prior marginal over x is f (x) =´f (x, y) · dy, we have shown the �rst part of the

theorem. The second part follows directly from the de�nition of conditional independence.

A.2 Information nudge approximation (Section 2.2)

Lemma 1. The posterior in the nudge approximation is

f (x | ν) =

{1 + ε2 ·

[(x− µ) · (ν − µ)− 1

2·[(x− µ)2 − σ2

]]}· f (x) .

Proof. To start, we need to use Bayes' rule to compute the posterior under the information

nudge approximation:

f (x | ν) =f (x) · n (ν |x)´f (x) · n (ν | x) · dx

.

39

The denominator for Bayes' rule is then

ˆn (ν | x) · f (x) · dx ∝

ˆ {1− 1

2· ε2 · (x− ν)2

}· f (x) · dx,

∝ 1− 1

2· ε2 ·

ˆ[(x− µ) + (µ− ν)]2 · f (x) · dx,

∝ 1− 1

2· ε2 ·

[σ2 + (ν − µ)2] ,

where we move from the second line to the third by noting that the 2 · (x− µ) · (µ− ν) term

disappears since µ is de�ned as the prior expectation of x. Hence, Bayes' rule gives our

posterior as

f (x | ν) =1− 1

2· ε2 · (x− ν)2

1− 12· ε2 ·

[σ2 + (ν − µ)2] · f (x) ,

≈{

1 +1

2· ε2 ·

[σ2 + (ν − µ)2 − (x− ν)2]} · f (x) .

=

{1 +

1

2· ε2 ·

[σ2 + (ν − µ)2 − [(x− µ) + (µ− ν)]2

]}· f (x) ,

=

{1 + ε2 ·

[(x− µ) · (ν − µ)− 1

2·[(x− µ)2 − σ2

]]}· f (x) .

where the second line uses the approximation 11+ε2≈ 1−ε2 and ignores terms that are o (ε2),

and the third and fourth put (x− ν)2 in a form where x is directly compared to µ instead

of ν.

Going forward, it will be useful to denote the central moments of the prior by µn ≡E [(x− µ)n].29 Now we proceed by computing the posterior expectation.

Lemma 2. The �rst moment of the posterior, centered about µ, is

E [x | ν] = µ+

[σ2 · (ν − µ)− 1

2· µ3

]· ε2.

Further, all posterior moments centered at µ can be expressed as

E [ψ (x) | ν] = E [ψ (x)] +O(ε2).

29Note that this means µ0 = 1, µ1 = 0, and µ2 = σ2.

40

Proof. For the �rst result, simple computation with the formula from Lemma 1 yields

E [x | ν] =

ˆ {1 + ε2 ·

[(x− µ) · (ν − µ)− 1

2·[(x− µ)2 − σ2

]]}· x · f (x) · dx

= µ+

ˆ {(x− µ) + ε2 ·

[(x− µ)2 · (ν − µ)− 1

2·[(x− µ)3 − σ2 · (x− µ)

]]}· f (x) · dx

= µ+ ε2 ·[σ2 · (ν − µ)− 1

2· µ3

].

The second follows directly from the fact that the posterior distribution is of form f (x) +

O (ε2).

Going forward, it will be easier to compute centralized moments of the posterior taken

around µ, but we will ultimately be more interested in central moments, taken around

E [x | ν]. Fortunately, these are related in a simple way.

Lemma 3. Moments of the posterior centered about µ and E [x | ν] are related by the following

expression:

E [(x− E [x | ν])n | ν] = E [(x− µ)n | ν]− n · µn−1 · σ2 ·[σ2 · (ν − µ)− 1

2· µ3

]· ε2 + o

(ε2).

Proof. If we express E [(x− E [x | ν])n | ν] as E[(

(x− µ)−[σ2 · (ν − µ)− 1

2· µ3

]· ε2)n ∣∣ ν],

all but the �rst two terms of the binomial expansion will be o (ε2). Then,

E [(x− E [x | ν])n | ν] = E [(x− µ)n | ν]− n · E[(x− µ)n−1

∣∣ ν] · [σ2 · (ν − µ)− 1

2· µ3

]· ε2 + o

(ε2),

= E [(x− µ)n | ν]− n · µn−1 ·[σ2 · (ν − µ)− 1

2· µ3

]· ε2 + o

(ε2),

where the second line came from the second part of Lemma 2.

Now, we can derive an express for the central moments of the posterior.

Theorem 11. The posterior central moments (that is, centered around E [x | ν] are:

E [(x− E [x | ν])n | ν]

= µn +

{(µn+1 − n · µn−1 · σ2

)· (ν − µ)− 1

2·[µn+2 − σ2 · µn − n · µ3 · µn−1

]}· ε2

+ o(ε2).

41

Proof. Looking to the formula for the posterior distribution, an arbitrary moment centered

at µ is given by

E [(x− µ)n | ν] = µn + ε2 ·{µn+1 · (ν − µ)− 1

2·[µn+2 − σ2 · µn

]}.

Plugging this into the result from Lemma 3 then yields

E [(x− E [x | ν])n | ν]

= µn+

{µn+1 · (ν − µ)− 1

2·[µn+2 − σ2 · µn

]− n · µn−1 ·

[σ2 · (ν − µ)− 1

2· µ3

]}·ε2+o

(ε2)

= µn+

{(µn+1 − n · µn−1 · σ2

)· (ν − µ)− 1

2·[µn+2 − σ2 · µn − n · µ3 · µn−1

]}·ε2+o

(ε2).

To further simplify this expression under the near-symmetric approximation, simply set all

odd moments equal to zero. To simplify further with the near-normal approximation, note

that for a normal distribution, µn+2 = (n+ 1)!! · σn+2 for all n ≥ 1.30

Finally, we show that Theorem 1 in the main text is a corollary of what we just proved.

Proof of Theorem 1. Lemma 2 and the identity γ1 ≡ µ3

σ3 yields the expression for the posterior

expectation. Plugging n = 2 into the expression from Theorem 11 yields

Var (x | ν) = µ2 +

{µ3 · (ν − µ)− 1

2·[µ4 − σ2 · µ2

]}· ε2 + o

(ε2),

= σ2 +

{σ3 · µ3

σ3· (ν − µ)− 1

2·[σ4 ·

(µ4

σ4− 3 + 3

)− σ4

]}· ε2 + o

(ε2),

= σ2 +

{σ3 · γ1 · (ν − µ)− σ4 ·

(1 +

1

2· γ2

)}· ε2 + o

(ε2).

where the third line comes from remembering that γ2 ≡ µ4

σ4 − 3. The results when γ1 and γ2

are small follow trivially.

30The double factorial is de�ned such that, for even n, (n+ 1)!! = Πn/2i=0 (2i+ 1) . So, for instance, 7!! =

7 · 5 · 3 · 1 = 105.

42

A.3 How the population responds to an information nudge (Sec-

tion 2.5)

Proof of Theorem 2. Taking the expectation of the product mentioned in the text yields

τ = E[t (µ) · σ2 · (ν − µ) · ε2

],

= ε2 · E[σ2]ˆ

t (µ) · (ν − µ) ·m (µ) · dµ,

= ε2 · E[σ2]ˆ

t (µ) ·m (µ) · dµ ·ˆ

(ν − µ) · t (µ) ·m (µ)´t (µ) ·m (µ) · dµ

· dµ,

= ε2 · E[σ2]· E [t (µ)] · (ν − E [µ | θ = µ]) ,

= ε2 · E[σ2]· E [m (θ)] · (ν − E [µ | θ = µ]) ,

where the last two lines establish the result.

B Proofs from Section 3

B.1 Threshold and belief distribution shifters (Section 3.2)

Two lemmas will prove useful.

Lemma 4. If t (θ |λt) and m (µ |λm) are ordered in the MLR sense, then the distribution

of prior expectations among marginal agents, g (µ | θ = µ; λt, λm) ≡ t(µ |λt)·m(µ |λm)´t(µ |λt)·m(µ |λm)·dµ , is

ordered in the MLR sense in both (µ, λm) and (µ, λt).

Proof. The likelihood ratio with respect to (µ, λt),g(µ | θ=µ;λ′t,λm)

g(µ | θ=µ;λt,λm), is given by:

g (µ | θ = µ; λ′t, λm)

g (µ | θ = µ; λ′t, λm)=

t(µ |λ′t)·m(µ |λm)´t(µ |λ′t)·m(µ |λm)·dµt(µ |λt)·m(µ |λm)´t(µ |λt)·m(µ |λm)·dµ

=t (µ |λ′t)t (µ |λt)

·´t (µ |λt) ·m (µ |λm) · dµ´t (µ |λ′t) ·m (µ |λm) · dµ

.

Since the ratio of integrals does not vary with µ, we have shown our result with respect to

(µ, λt). Similar logic shows it with respect to (µ, λm).

Lemma 5. For all x,

• limλt→−∞

G (x | θ = µ; λt, λm) = limλm→−∞

G (x | θ = µ; λt, λm) = 1, and

• limλt→∞

G (x | θ = µ; λt, λm) = limλm→∞

G (x | θ = µ; λt, λm) = 0.

43

Proof. Writing out the de�nition of the conditional distribution functions, we get

G (x | θ = µ; λt, λm) =

´ x t (µ |λt) ·m (µ |λm) · dµ´t (µ |λt) ·m (µ |λm) · dµ

=T (x |λt) ·

´ x t (µ |µ ≤ x; λt) ·m (µ |λm) · dµT (x |λt) ·

´ x t (µ |µ ≤ x; λt) ·m (µ |λm) · dµ+ [1− T (x |λt)] ·´x t (µ |µ ≥ x; λt) ·m (µ |λm) · dµ

.

Clearly, as λt → ∞, this is zero, and when λt → −∞, it is one, since the integrals are all

�nite for �xed value of x. Similar logic works for λm → ±∞.

Now, we can prove the assertion made in Footnote 18 of the main text.

Lemma 6. Given the assumptions of Section 3.2, the following are true:

• E [µ | θ = µ; λt, λm] is increasing in λt, and its range for any �xed λm is (−∞,∞).

• E [µ | θ = µ; λt, λm] is increasing in λm, and its range for any �xed λt is (−∞,∞).

Proof of Lemma 6. Lemma 4 immediately implies the �rst �half� of both parts, since µ is an

increasing function. For second �half� of Part 1, note that E [µ | θ = µ; λt, λm] can be written

as an integral of the conditional distribution function, that is,

E [µ | θ = µ; λt, λm] =

ˆ ∞0

[1−G (x | θ = µ; λt, λm)] · dx−ˆ 0

−∞G (x | θ = µ; λt, λm) · dx.

Given this, the result on the range of E [µ | θ = µ; λt, λm] with respect to either λm or λt is

a straightforward implication of Lemma 5.

B.2 Shifter comparative statics (Sections 3.2.1 - 3.2.2)

Before our next set of claims, we prove two useful lemmas.

Lemma 7. The following are true:

1. If limµ→±∞

ψ (µ) exists, then limλt→±∞

´t (µ |λt) · ψ (µ) · dµ = lim

µ→±∞ψ (µ).

2. If limµ→±∞

ψ (µ) exists, then limλm→±∞

´m (µ |λm) · ψ (µ) · dµ = lim

µ→±∞ψ (µ).

Proof. For the �−∞� piece of Part 1, note that for any value of x, our integral can also be

expressed as

T (x |λt) ·ˆ x

t (µ | µ ≤ x; λt) · ψ (µ) · dµ+ [1− T (x |λt)] ·ˆx

t (µ | µ ≥ x; λt) · ψ (µ) · dµ,

44

Clearly then, for any x, the following bounds must hold:

T (x |λt) ·minµ≤x

ψ (µ) + [1− T (x |λt)] ·minµ≥x

ψ (µ)

≤ˆt (µ |λt) · ψ (µ) · dµ ≤

T (x |λt) ·maxµ≤x

ψ (µ) + [1− T (x |λt)] ·maxµ≥x

ψ (µ) .

Taking the limit of these bounds as λt → −∞, we get minµ≤x

ψ (µ) ≤ limm→−∞

´t (µ |λt) · ψ (µ) ·

dµ ≤ maxµ≤x

ψ (µ). If we then take the limit as x→ −∞, we get limλt→−∞

´t (µ |λt) ·ψ (µ) ·dµ =

limµ→−∞

ψ (µ). The same logic with λt, x→∞ yields the �+∞� piece of Part 1. Part 2 follows

by identical logic.

Lemma 8. If f (x) is log-concave on some domain, then on that domain, it can only have

one maximum.

Proof. Log-concavity implies that ∂2 log f∂x2 < 0 and f > 0. Together, these imply that f ′′ < f ′2

f.

At an optimum, f ′ = 0, which combined with f ′′ < f ′2

fmeans that f ′′ < 0, the second-order

condition for a local maximum. And if all optima are maxima, then there can only be one

on the range.

Proof of Proposition 2. For Part 1, it will be easiest to deal with the treatment e�ect in the

form τ = ε2 · E [σ2] ·´t (µ |λt) · m (µ) · (ν − µ) · dµ: simply apply Lemma 7 with ψ (µ) =

m (µ) · (ν − µ). Since´m (µ) · dµ = 1 implies that the tail of m (µ) is o (1/ |µ|), we know

that limµ→±∞

m (µ) · (ν − µ) = 0. It then follows that limλt→±∞

´t (µ |λt) ·m (µ) · (ν − µ) · dµ = 0,

and hence limλt→±∞

τ (λt) = 0. Part 2 is obvious given the form of τ in Theorem 2: the sign of τ

is the same as ν − θ. To show Part 3, again think about the form of τ in Theorem 2. Recall

that E [t (µ |λt)] ≡´t (µ |λt) · m (µ) · dµ. Clearly, the product of log-concave functions is

log-concave, so the integrand is log-concave in (µ, λt). Prékopa (1973) then tells us that the

marginals of log-concave functions are log-concave with respect to the remaining variables;

hence, E [t (µ |λt)] is log-concave in λt. Now, since ν−λt is log-concave whenever λt < ν,31 we

have established that τ (λt) is log-concave on that domain. Since limλt→−∞

τ (λt) = τ (ν) = 0,

there must be a local maximum on λt < ν. By Lemma 8, this is the unique maximum.

Similar logic applied to −τ (λt) on the domain λt > ν gives us a unique minimum on that

domain.

The proof of Proposition 3 uses identical logic.

31To see this, note that log (ν − λt) only exists when λt < ν, and when it does, ∂2

∂λ2t

log (ν − λt) =

− 1(ν−λt)

2 < 0.

45

B.3 Proxying for shifters with matriculation (Section 3.3)

Lemma 9. The following results are true:

• β (λt, λm) is increasing in λm and decreasing in λt.

• If the domain of (λt, λm) is (−∞,∞)× (−∞,∞), then the range of β (λt, λm) is [0, 1].

Proof of Lemma 9. To see the �rst part, recall that the MLR ordering implies the �rst-order

stochastic ordering, which is equivalent to T2 (µ |λt) < 0 andM2 (µ |λm) < 0. For the second

part, note that the matriculation integral obeys the bounds

0 ·ˆ x

m (µ) · dµ+ T (x |λt) ·ˆx

m (µ) · dµ

≤ β (λt) ≤

T (x |λt) ·ˆ x

m (µ) · dµ+ 1 ·ˆx

m (µ) · dµ.

As λt → −∞, the bounds become´xm (µ) ·dµ ≤

´T (µ |λt) ·m (µ) ·dµ ≤

´m (µ) ·dµ, which

as x→ −∞, implies via the squeeze theorem that limλt→−∞

β (λt) = 1. As λt →∞, the bounds

become 0 ≤´T (µ |λt) ·m (µ) · dµ ≤

´xm (µ) · dµ, which as x→∞, implies lim

λt→∞β (λt) = 0.

Similar logic shows a similar result for the λm shifter.

Theorem 3 is a result of plugging Lemma 9 into Theorems 2 and 3.

C Normal model (Section 4.1)

As described qualitatively in the main text, we will assume that µ and θ are distributed

according to density functions

m (µ) = ϕ (µ− E [µ]) ,

t (θ |λt) =1

η· ϕ(θ − E [θ |λt]

η

).

where ϕ is the density of the standard normal, and E [θ |λt] ≡ (1 + η2) · λt − η2 · E [µ].

Our de�nition for E [θ |λt] is without loss of generality, and it ultimately leads to λt =

E [µ | θ = µ; λt], as in the model in the main text (we will show this in the following proof).

Also, note that scaling all quantities up or down doesn't change the model, so we are free

to normalize Var [µ] to 1. We will derive things in terms of η2 ≡ Var[µ]Var[θ]

, but the choice or

normalization will come back towards the end of the following proof.

46

Proof of Theorem 4. We start by calculating the matriculation e�ect. Since µ and θ are

independent and normally distributed, their di�erence is distributed according to θ − µ ∼1√

1+η2·ϕ(

(θ−µ)−(E[θ |λt]−E[µ])√1+η2

), which, since agents join when θ−µ ≤ 0, means that β (λt) =

Φ

(E[µ]−E[θ |λt]√

1+η2

)= Φ

[√1 + η2 · (E [µ]− λt)

]. Inverting, we can express λt in terms of β as

λt (β) = E [µ]− 1√1+η2· Φ−1 (β).

Now, we move on to calculate the treatment e�ect. To do so, we simplify the expression

t (µ |λt) ·m (µ) by using the following identity concerning the product of two normals:

ϕ

(µ− ab

)· ϕ(µ− cd

)= ϕ

(a− c√b2 + d2

)· ϕ

(µ− a·d2+c·b2

b2+d2

b·d√b2+d2

).

Doing so gives us,

t (µ |λt) ·m (µ) =1

η· ϕ

(E [θ |λt]− E [µ]√

1 + η2

)· ϕ

µ− E[θ |λt]+E[µ]·η2

1+η2

η√1+η2

,

=1√

1 + η2· ϕ

λt − E [µ]1√

1+η2

· 1η√

1+η2

· ϕ

µ− λtη√

1+η2

,

where the second line comes from plugging in E [θ |λt] = (1 + η2) · λt − η2 · E [µ]. Using this

expression, it is easy to show that

E [t (µ |λt)] =

ˆt (µ |λt) ·m (µ) · dµ =

1√1 + η2

· ϕ

λt − E [µ]1√

1+η2

,

E [µ | θ = µ; λt] =

´t (µ |λt) ·m (µ) · µ · dµ´t (µ |λt) ·m (µ) · dµ

= λt,

where the second line makes good on our promise from earlier. Hence,

τ (λt) = ε2 · E[σ2]· 1√

1 + η2· ϕ

λt − E [µ]1√

1+η2

· (ν − λt) ,τ (β) = ε2 · E

[σ2]· 1√

1 + η2· ϕ(Φ−1 (β)

)·

(z (ν) +

1√1 + η2

· Φ−1 (β)

),

where the second line came from plugging in λt (β) = E [µ] − 1√1+η2

· Φ−1 (β), using the

fact that ϕ (−x) = ϕ (x), and remembering that since we normalized to make the standard

47

deviation of the belief distribution equal to one, ν −E [µ] equals the z-score of ν normalized

against the belief distribution, which we denote as z (ν).

Proof of Theorem 4. From this formula, we can easily derive the β0 from Theorems 3 to be

β0 = Φ(−z (ν) ·

√1 + η2

).

Since this is the only zero of τ (β), for any β 6= β0 the treatment e�ect has local optima

wherever

1

τ (β)· ∂τ (β)

∂β=

−Φ−1 (β) +1√

1 + η2· 1

z (ν) + 1√1+η2· Φ−1 (β)

· ∂Φ−1 (β)

∂β= 0,

where we used the fact that ϕ′(x)ϕ(x)

= −x to simplify. Making the term in curly brackets equal

to zero is a simple matter of solving a quadratic in Φ−1 (β) whose solutions are

β± = Φ

(1

2· Φ−1 (β0)±

√1

4· [Φ−1 (β0)]2 + 1

),

= Φ

−1

2· z (ν) ·

√1 + η2 ±

√[1

2· z (ν) ·

√1 + η2

]2

+ 1

.

Clearly, the β+ is the maximum, while β− is the minimum.

Proposition 6. In the normal model, when the signal is �good news�, that is, z(ν) > 0, we

know that, β0 ∈ (0,Φ (−z (ν))), which means that

• β+ ∈(

12,Φ

(−1

2· z (ν) +

√(12· z (ν)

)2+ 1

)),

• β− ∈(

0,Φ

(−1

2· z (ν)−

√(12· z (ν)

)2+ 1

)),

• τ (β) is increasing on β ∈(

Φ

(−1

2· z (ν)−

√(12· z (ν)

)2+ 1

), 1

2

),

• τ (β) is decreasing on β ∈(

Φ

(−1

2· z (ν) +

√(12· z (ν)

)2+ 1

), 1

).

Similarly, when the signal is �bad news�, that is, z(ν) < 0, we know that β0 ∈ (Φ (−z (ν)) , 1),

which means that

• β+ ∈(

Φ

(−1

2· z (ν) +

√(12· z (ν)

)2+ 1

), 1

),

48

• β− ∈(

Φ

(−1

2· z (ν)−

√(12· z (ν)

)2+ 1

), 1

2

),

• τ (β) is decreasing on β ∈(

0,Φ

(−1

2· z (ν)−

√(12· z (ν)

)2+ 1

)),

• τ (β) is increasing on β ∈(

12,Φ

(−1

2· z (ν) +

√(12· z (ν)

)2+ 1

)).

Proof. If the signal is �good news�, i.e. z (ν) > 0, then the formula β0 = Φ(−√

1 + η2 · z (ν))

makes it clear that β0 ∈ (0,Φ (−z (ν))). Now, the derivative of the extrema locations is

∂β±∂β0

= ϕ

1

2· Φ−1 (β0)±

√(1

2· Φ−1 (β0)

)2

+ 1

·12·

1±12· Φ−1 (β0)√(

12· Φ−1 (β0)

)2+ 1

·∂Φ−1

∂β(β0) ,

which is always positive. Hence we can plug in the domain of β0 to get the range of β±.

Doing so yields the ranges in the theorem. To understand the ranges on which τ (β) is

increasing and decreasing, note that the curve is necessarily decreasing on β ∈ (0, β−),

increasing on β ∈ (β−, β+), and decreasing on β ∈ (β+, 1). This means that the curve is

always decreasing on β ∈(

0,minη{β−}

), increasing on β ∈

(maxη{β−} ,min

η{β+}

), and

decreasing on β ∈(

maxη{β+} , 1

), which gives the ranges in the theorem. We have only laid

out the rationale for the �good news� case in the theorem, but the �bad news� case follows

from identical logic.

Proof of Proposition 4. To derive the ranges for β±, we simply �nd the z (ν) that makes the

range from Proposition 4 the largest. The ranges of monotonicity follow for the same reasons

described in the proof of Proposition 4.

Proof of Theorem 5. The intervals that de�ne Λ (β | η) and Υ (β | η) follow trivially from theshape of the treatment e�ect curve and the de�nition of b (η). To sign b′ (η), we start bytaking the derivative of the log of its de�nition, log [τ (b (η) | η)] = log [τ (β | η)]:32

∂τ∂β

(b (η)) · b′ (η)− η1+η2

·[τ (b (η) | η) + ε2 · E

[σ2]· 1

1+η2· ϕ(Φ−1 (b (η))

)· Φ−1 (b (η))

]τ (b (η) | η)

=− η

1+η2·[τ (β | η) + ε2 · E

[σ2]· 1

1+η2· ϕ(Φ−1 (β)

)· Φ−1 (β)

]τ (β | η)

,

32This is allowed even when the argument of the log is negative. To see this, assume x < 0, and note thatlog (x) = log

(|x| · e−π·i

)= log (|x|) + log

(e−π·i

)= log (|x|)− π · i.

49

which simpli�es to

∂τ∂β

(b (η)) · b′ (η)

τ (b (η) | η)−

η

(1+η2)3/2 · Φ−1 (b (η))

z (ν) + 1√1+η2· Φ−1 (b (η))

= −η

(1+η2)3/2 · Φ−1 (β)

z (ν) + 1√1+η2· Φ−1 (β)

,

which in turn simpli�es to

b′ (η) =η

(1 + η2)3/2·τ (b (η) | η)

τ ′ (b (η))·

Φ−1 (b (η)) ·

[z (ν) + 1√

1+η2· Φ−1 (β)

]− Φ−1 (β) ·

[z (ν) + 1√

1+η2· Φ−1 (b (η))

][z (ν) + 1√

1+η2· Φ−1 (b (η))

]·[z (ν) + 1√

1+η2· Φ−1 (β)

] ,

=

η

(1+η2)3/2· τ (b (η) | η) · z (ν)[

z (ν) + 1√1+η2

· Φ−1 (b (η))

]·[z (ν) + 1√

1+η2· Φ−1 (β)

] · 1

τ ′ (b (η))·{

Φ−1 (b (η))− Φ−1 (β)},

=

{η

(1 + η2)5/2· ε4 ·

(E[σ2])2 · ϕ (Φ−1 (β)

)· ϕ(Φ−1 (b (η))

)·}·z (ν)

τ (β | η)·

1

τ1 (b (η) | η)·{

Φ−1 (b (η))− Φ−1 (β)}.

Now, the �rst term is positive, regardless. For both parts of the proposition, z (ν) and τ (β | η)

have the same sign, and τ1 (β | η) is negative (which means that τ1 (b (η) | η) is positive).

Hence, for both cases considered in the proposition,

sgn {b′ (η)} = sgn{

Φ−1 (b (η))− Φ−1 (β)},

= sgn {b (η)− β} .

The assertions about the sign of b′ (η) in the proposition follow directly from this and the

general shape of the treatment e�ect curve.

Now, to establish that setting η = 0 gives us an inner bound, we just have to show that β

ful�lling our assumptions doesn't also tell us that η has a lower bound greater than zero.

For the �rst part, note that being on the positive and decreasing part of τ (β) puts β ≥ β+,

and hence β ≥ 12by Proposition 4. If β ≥ 1

2, then Φ−1 (β) ≥ 0, which means that τ (β) ≥ 0

for any η > 0, since the sign of τ is the same as the sign of z (ν) + 1√1+η2· Φ−1 (β). For the

second part, being negative and decreasing puts β ≤ β−, and hence β ≤ 12by Proposition 4.

Similar logic shows that τ (β) ≤ 0 for any η > 0.

50

D Proofs from Section 6

D.1 Attrition (Section 6.2)

Proof of Theorem 7. Integrating the product mentioned in the main text over µ and σ, weget

ˆT (µ) · tN (µ) · σ2 · (ν − µ) · ε2 ·m (µ) · s

(σ2)· dµ · dσ2

= E[σ2]·{ˆ

tN (µ) · T (µ) ·m (µ) · dµ}·{ ´

tN (µ) · T (µ) ·m (µ) · (ν − µ) · dµ´tN (µ) · T (µ) ·m (µ) · dµ

}· ε2

= E[σ2]·{ˆ

T (µ) ·m (µ) · dµ}·{ ´

tN (µ) · T (µ) ·m (µ) · dµ´T (µ) ·m (µ) · dµ

}·(t− E

[µ∣∣∣ θ ≤ µ = θN

])· ε2

= E[σ2]· β · E

[tN (µ)

∣∣∣µ ≥ θ] · (t− E[µ∣∣∣ θ ≤ µ = θN

])· ε2.

Dividing through by the baseline matriculation rate gives the e�ect on attrition, which we

is what we sought to prove.

Now we compute this e�ect in the normal model.

Proof of Proposition 5. Let µ ∼ Φ (µ− E [µ]) and θ, θN ∼ Φ(θ−E[θ |λt]

η

), independently,

where E [θ |λt] ≡ (1 + η2) · λt − η2 · E [µ]. In the proof of Theorem 4, we already showed

that β (λt) = Φ[√

1 + η2 · (E [µ]− λt)], and hence λt (β) = E [µ] − 1√

1+η2· Φ−1 (β). So,

E [θ | β] ≡ E [µ]−√

1 + η2 · Φ−1 (β), and hence θ, θN ∼ Φ

(θ−(E[µ]−√

1+η2·Φ−1(β))

η

). Now,

E[µ∣∣∣ θ ≤ µ = θN

]=

´µ · ϕ (µ− E [µ]) · Φ

(µ−(E[µ]−

√1+η2·Φ−1(β)

)η

)· 1η· ϕ(µ−(E[µ]−

√1+η2·Φ−1(β)

)η

)· dµ

´ϕ (µ− E [µ]) · Φ

(µ−(E[µ]−

√1+η2·Φ−1(β)

)η

)· 1η· ϕ(µ−(E[µ]−

√1+η2·Φ−1(β)

)η

)· dµ

.

To start this computation, we recall our identity from the proof of Theorem 4,

ϕ

(µ− ab

)· ϕ(µ− cd

)= ϕ

(a− c√b2 + d2

)· ϕ

(µ− a·d2+c·b2

b2+d2

b·d√b2+d2

),

which allows us to write

ϕ (µ− E [µ]) · ϕ

µ−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

= ϕ

E [µ]−(E [µ]−

√1 + η2 · Φ−1 (β)

)√

1 + η2

· ϕµ−

E[µ]·η2+(E[µ]−

√1+η2·Φ−1(β)

)1+η2

η√1+η2

,

= ϕ(Φ−1 (β)

)· ϕ(µ ·√

1 + η2 − E [µ] ·√

1 + η2 + Φ−1 (β)

η

).

51

Now, we can write

ˆΦ

µ−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

· ϕ (µ− E [µ]) ·1

η· ϕ

µ−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

· dµ= ϕ

(Φ−1 (β)

)·ˆ

Φ

µ−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

· 1

η· ϕ(µ ·√

1 + η2 − E [µ] ·√

1 + η2 + Φ−1 (β)

η

)· dµ

= ϕ(Φ−1 (β)

)·ˆ

Φ

(

η√1+η2

· x+ E [µ]− Φ−1(β)√1+η2

)−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

· ϕ (x) ·1

η·

η√1 + η2

· dx

= ϕ(Φ−1 (β)

)·ˆ

Φ

(x√

1 + η2+

η√1 + η2

· Φ−1 (β)

)· ϕ (x) ·

1√1 + η2

· dx

=1√

1 + η2· ϕ(Φ−1 (β)

)· Φ(

η · Φ−1 (β)√(1 + η2) · (2 + η2)

),

where we use the Gaussian integral identity´

Φ (a+ b · x) · ϕ (x) · dx = Φ(

a√1+b2

), and

ˆµ · ϕ (µ− E [µ]) · Φ

µ−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

· 1

η· ϕ

µ−(E [µ]−

√1 + η2 · Φ−1 (β)

)η

· dµ= ϕ

(Φ−1 (β)

)·ˆ (

η√1 + η2

· x+ E [µ]−Φ−1 (β)√

1 + η2

)· Φ(

x√1 + η2

+η√

1 + η2· Φ−1 (β)

)· ϕ (x) ·

1√1 + η2

· dx

= ϕ(Φ−1 (β)

)·ˆ (

η√1 + η2

· x)· Φ(

x√1 + η2

+η√

1 + η2· Φ−1 (β)

)· ϕ (x) ·

1√1 + η2

· dx

+ ϕ(Φ−1 (β)

)·(E [µ]−

Φ−1 (β)√1 + η2

)·ˆ

Φ

(x√

1 + η2+

η√1 + η2

· Φ−1 (β)

)· ϕ (x) ·

1√1 + η2

· dx

=η

1 + η2· ϕ(Φ−1 (β)

)·ˆx · Φ

(x√

1 + η2+

η√1 + η2

· Φ−1 (β)

)· ϕ (x) · dx

+

(E [µ]−

Φ−1 (β)√1 + η2

)·

1√1 + η2

· ϕ(Φ−1 (β)

)· Φ(

η · Φ−1 (β)√(1 + η2) · (2 + η2)

)

=1√

1 + η2· ϕ(Φ−1 (β)

)·{

η√(1 + η2) · (2 + η2)

· ϕ(η · Φ−1 (β)√

2 + η2

)+

(E [µ]−

Φ−1 (β)√1 + η2

)· Φ(

η · Φ−1 (β)√(1 + η2) · (2 + η2)

)}

where we use the Gaussian integral identity´x ·Φ (a+ b · x) ·ϕ (x) · dx = b√

1+b2·ϕ(

a√1+b2

).

So,

E[µ∣∣∣ θ ≤ µ = θN

]

=

1√1+η2

· ϕ(Φ−1 (β)

)·{

η√(1+η2)·(2+η2)

· ϕ(η·Φ−1(β)√

2+η2

)+

(E [µ]− Φ−1(β)√

1+η2

)· Φ(

η·Φ−1(β)√(1+η2)·(2+η2)

)}1√

1+η2· ϕ (Φ−1 (β)) · Φ

(η·Φ−1(β)√

(1+η2)·(2+η2)

),

= E [µ]−1√

1 + η2·

Φ−1 (β)−η√

2 + η2·

ϕ

(η√

2+η2· Φ−1 (β)

)Φ

(η√

(1+η2)·(2+η2)· Φ−1 (β)

),

.

52

Now, we can compute the treatment e�ect on attrition rate

τA = E[σ2]· E[tN (µ)

∣∣∣µ ≥ θ] · (t− E[µ∣∣∣ θ ≤ µ = θN

])· ε2,

= ε2 · E[σ2]·

1√1 + η2

· ϕ(Φ−1 (β)

)· Φ(

η · Φ−1 (β)√(1 + η2) · (2 + η2)

)

·

z (ν) +1√

1 + η2· Φ−1 (β)−

η√(1 + η2) · (2 + η2)

·ϕ

(η√

2+η2· Φ−1 (β)

)Φ

(η√

(1+η2)·(2+η2)· Φ−1 (β)

),

.

This is what we sought to prove.

Comparing the treatment e�ect on attrition rate to the treatment e�ect on matriculation

rate, we see that while the latter is positive when z (ν) ≥ E [µ | θ = µ] = − 1√1+η2· Φ−1 (β),

the former is positive when z (ν) ≥ E[µ∣∣ θ ≤ µ = θN

]= E [µ | θ = µ] + η√

(1+η2)·(2+η2)·

ϕ

(η√

2+η2·Φ−1(β)

)Φ

(η√

(1+η2)·(2+η2)·Φ−1(β)

),

. Since this extra term is positive, it represents the width of the

range of z (ν) such that we expect positive treatment e�ect on matriculation and negative

treatment e�ect on attrition rate.

D.2 Nudges that reassure (Section 6.3)

Proof of Theorem 8. Recall from Sections 2.3 and 2.5 that untreated agents join when E [x]−12·α1·E

[(x− E [µ])2] ≥ θ, while treated agents join when E [µ | ν]− 1

2·α1·E

[(x− E [µ])2

∣∣ ν] ≥θ. Hence, the width of the range of thresholds that are persuaded by the nudge is

E [x | ν]− E [x]− 1

2· α1 ·

{E[(x− E [µ])2

∣∣ ν]− E[(x− E [µ])2]}

= σ2 ·{

(ν − µ)− 1

2· σ · γ1

}· ε2 − 1

2· α1 ·

{E[(x− E [x | ν])2

∣∣ ν]− E[(x− µ)2]

+ (E [x | ν]− E [µ])2 − (µ− E [µ])2} .Expanding out the components of the α1 term, we get

E[(x− E [x | ν])2

∣∣ ν]− E[(x− µ)2] = −σ2 ·

[σ2 ·

(1 +

1

2· γ2

)− σ · γ1 · (ν − µ)

]· ε2,

= −σ4 · ε2 − 1

2· σ4 · γ2 · ε2 + σ3 · γ1 · (ν − µ) · ε2,

= −σ4 · ε2 +O(γ2 · ε2

)+O

(γ1 · (ν − µ) · ε2

),

53

and

(E [x | ν]− E [µ])2−(µ− E [µ])2 =

((µ− E [µ]) + σ2 ·

{(ν − µ)− 1

2· σ · γ1

}· ε2

)2

−(µ− E [µ])2

= 2 · (µ− E [µ]) · σ2 ·{

(ν − µ)− 1

2· σ · γ1

}· ε2 + σ4 ·

{(ν − µ)− 1

2· σ · γ1

}2

· ε4

= O((ν − µ) · ε2

)+O

(γ1 · ε2

)+O

((ν − µ) · ε4

)+O

(γ1 · ε4

).

Hence, the width of the range of thresholds is

E [x | ν]− E [x]− 1

2· α1 ·

{E[(x− E [µ])2

∣∣ ν]− E[(x− E [µ])2]}

= σ2 ·{

(ν − µ)− 1

2· σ · γ1

}· ε2 +

1

2· σ4 · α1 · ε2 + o

(α1 · ε2

).

We will multiply this by the density of thresholds, evaluated at E [x]− 12·α1·E

[(x− E [µ])2] =

µ+O (α1), which is equal to t (µ)+O (α1). To leading order, then, the product of the density

and the range is t (µ)·[σ2 ·

{(ν − µ)− 1

2· σ · γ1

}+ 1

2· σ4 · α1

]·ε2.Taking the expectation gives

τ = E [t (µ)] ·{E[σ2]· (ν − E [µ | θ = µ])− 1

2· E[σ3]· E [γ1] +

1

2· E[σ4]· E [α1]

}· ε2,

which is what we sought to show.

D.3 Continuous choices (Section 6.4)

Proof of Theorem 9. Let a∗ (ε2) be the maximizer of the optimization from the main text.

The �rst-order conditions for this optimization require

a∗(ε2)

= a0 −u1 (a0,E [µ]) + u12 (a0,E [µ]) · [µ− E [µ] + σ2 · (ν − µ) · ε2]

u11 (a0,E [µ]).

To leading order then, a∗ will move by

ε2 · ∂a∗

∂ε2(0) = −u12 (a0,E [µ])

u11 (a0,E [µ])· σ2 · (ν − µ) · ε2.

Taking the expectation across the population yields our result.

54

D.4 Strategic signal revelation (Section 6.5)

Proof of Theorem 10. First, we need to calculate how agents will respond to being told that

ν is in partition element Π. De�ne ω (Π | ν0) ≡´

Πn (ν | ν0) · dν. Then,

f (x |Π) =f (x) · ω (Π |x)´f (x) · ω (Π | x) · dx

=f (x) ·

{ω (Π | ν0) + ω2 (Π | ν0) · (x− ν0) · ε+ 1

2· ω22 (Π | ν0) · (x− ν0)2 · ε2

}´f (x) ·

{ω (Π | ν0) + ω2 (Π | ν0) · (x− ν0) · ε+ 1

2· ω22 (Π | ν0) · (x− ν0)2 · ε2

}· dx

=1 + ε · ω2(Π | ν0)

ω(Π | ν0)· (x− ν0) + 1

2· ε2 · ω22(Π | ν0)

ω(Π | ν0)· (x− ν0)2

1 + ε · ω2(Π | ν0)ω(Π | ν0)

· (µ− ν0) + 12· ε2 · ω22(Π | ν0)

ω(Π | ν0)·[σ2 + (µ− ν0)2] · f (x) ,

The natural place to center the expansion is at the maximum likelihood estimate of x given

Π, that is, νΠ = arg maxν0

´Πn (ν | ν0) · dν. The �rst-order conditions for this optimization

dictate

ˆΠ

n2 (ν | νΠ) · dν = 0,

=

ˆΠ

[n2 (ν | ν) + ε · n22 (ν | ν) · (νΠ − ν)] · dν = 0,

ε ·(νΠ ·ˆ

Π

n22 (ν | ν) · dν −ˆ

Π

ν · n22 (ν | ν) · dν)

= 0,

νΠ =

´Πν · n22 (ν | ν) · dν´

Πn22 (ν | ν) · dν

.

Setting ν0 = νΠ, we get

f (x |Π) =

[1 +

1

2· ε2 · ω22 (Π | νΠ)

ω (Π | νΠ)· (x− νΠ)2

]·[1− ε2 · 1

2· ω22 (Π | νΠ)

ω (Π | νΠ)·[σ2 + (µ− νΠ)2]]·f (x)

=

{1− 1

2· ε2 · ω22 (Π | νΠ)

ω (Π | νΠ)·([σ2 + (µ− νΠ)2]− (x− νΠ)2)} · f (x) .

At this point, the parallel with Lemma 1 is clear: agents will treat ν ∈ Π equivalently to

ν = νΠ. This means that

τ (Π) = −ε2 · E [t (µ)] · E[σ2]· ω22 (Π | νΠ)

ω (Π | νΠ)· (νΠ − E [µ | θ = µ]) .

55

A Model of Information Nudges - Stanford University · 2016-04-26 · an implicit assumption that a...

Documents

Transcript of A Model of Information Nudges - Stanford University · 2016-04-26 · an implicit assumption that a...