A Model of Information Nudges - Stanford University · 2016-04-26 · an implicit assumption that a...
Transcript of A Model of Information Nudges - Stanford University · 2016-04-26 · an implicit assumption that a...
A Model of Information Nudges
Lucas Co�man∗ Clayton R. Featherstone† Judd B. Kessler†
November 24, 2015
Abstract
A growing empirical literature has demonstrated that providing decision-makers
with information (e.g. about the actions of others or the returns to di�erent actions)
can a�ect behavior. However, the literature lacks a theory that can explain when such
interventions will have a large e�ect or even the sign of the e�ect. We introduce such
a theory, based on simple Bayesian updating in a setting of binary choice. It yields the
following intuitive insight: the sign of the e�ect depends on whether the intervention
causes the marginal agent to update her belief up or down. Further, the magnitude of
the e�ect depends on both the density of agents at the margin and how much those
agents' beliefs move when treated. We also show that when it is prohibitively costly or
impossible to directly measure the beliefs of marginal agents, we can proxy for these
beliefs with the fraction of agents taking the action in the uninformed group. Utilizing
this intuition, our model makes a strong prediction about how treatment e�ect sign and
magnitude will vary with the proportion taking the action in the control group. Our
model reasonably rationalizes results from the literature: we perform a meta-analysis
of informational nudges and �nd that, even across very di�erent experimental settings,
the magnitude of the treatment e�ect varies in a way our theory predicts (note: more
data to come soon).
∗Department of Economics, The Ohio State University†The Wharton School, University of Pennsylvania
1
1 Introduction
In recent years, a large and growing body of empirical work has investigated the role of
�nudges� on behavior.1 As this literature has blossomed, researchers have investigated the
e�cacy of a number of prominent nudges across a wide variety of settings. One surprise from
the broad application of these interventions is that some of the most empirically robust and
behaviorally intuitive nudges sometimes fail to in�uence behavior or to in�uence behavior in
the opposite direction than expected. These results are seen as surprising since many hold
an implicit assumption that a nudge that works in one context will work in all other contexts
rather than relying on theory to assess when a nudge is likely to be e�ective.
This paper introduces such a theory for one of the most popular nudges: providing
individuals with information about a choice they face. We develop a Bayesian updating
model that treats the information as a signal that leads individuals to update about the
utility they receive from one of the outcomes. This information could be social information
(e.g. information that the majority of other people donate to a charity) or information about
the costs or bene�ts of taking actions (e.g. information about the returns to graduating from
high school).2 Providing individuals with information has successfully changed behavior in
a wide variety of contexts.3 However, a number of prominent empirical papers have found
null results from providing information (see, e.g., Allcott and Taubinsky (2015), Avitabile
and De Hoyos Navarro (2015), Bettinger et al. (2012), Hastings, Neilson and Zimmerman
(2015), Slemrod, Blumenthal and Christian (2001)) or found treatment e�ects in the opposite
direction than expected for at least some groups (see, e.g., Fellner, Sausgruber and Traxler
(2013),Bhargava and Manoli (2013), Beshears et al. (2015a)).4 In the absence of a model,
1See Sunstein and Thaler (2008) for a detailed discussion of nudges. This work has been in�uential inthe policy domain, spawning nudge units in the U.K. (called the Behavioral Insights Team), U.S. (called theSocial and Behavioral Sciences Team), and around the world.
2See Vesterlund (2003) for a model of how sequential fund-raising can allow potential donors to provideinformation to one another about the quality of a charity. Our model is in this spirit and inspired byVesterlund (2003) but considers a general information structure.
3Information about others' decisions has a�ected decisions to donate money (see, e.g., Frey and Meier(2004), Martin and Randal (2008), Croson and Shang (2008), Shang and Croson (2009)), rate movies (Chenet al. (2010)), order certain entrées (Cai, Chen and Fang (2009)), save energy (Allcott (2011)), reuse tow-els (Goldstein, Cialdini and Griskevicius (2008)), pay taxes (Hallsworth et al. (2014)), like certain songs(Salganik, Dodds and Watts (2006)), steal petri�ed wood (Cialdini et al. (2006)), intend to vote (Ger-ber and Rogers (2009)), litter (Cialdini, Reno and Kallgren (1990)), take a job (Co�man, Featherstone andKessler (2014)), give money in a laboratory public goods games (Keser and Van Winden (2000), Fischbacher,Gächter and Fehr (2001), and Potters, Sefton and Vesterlund (2005)). Information about the costs or bene-�ts of di�erent actions has been shown to a�ect school choice (Hastings and Weinstein (2007)), standardizedtest scores (Nguyen (2008)), graduation rates (Jensen (2010)), claiming tax bene�ts (Bhargava and Manoli(2013)), tax compliance (Pomeranz (2015)), 401k contribution levels (Clark, Maki and Morrill (2014)), eat-ing fewer calories (Bollinger, Leslie and Sorensen (2011)), responding to energy price changes (Jessoe andRapson (2014)), and purchasing �uorescent light bulbs (Allcott and Taubinsky (2015)), among many others.
4While information interventions are the focus of this paper, nudges in other domains may also fail to
2
many papers attempt to rationalize their null or negative results by alluding to the potential
for negative belief updates � implying that the information treatment ended up leading
agents to update in the opposite direction than had been intended � an argument based on
the intuition that nudges are more likely to be e�ective when the information about the value
of the action is �good news� relative to the average belief (see, e.g., Schultz et al. (2007)).5
Our model focuses on binary decision problems and shows that with Bayesian updating
and very general assumptions, this standard intuition is wrong. Our key insight is that what
matters for whether the treatment e�ect of an information nudge is positive or negative
is how the information a�ects people at the margin. This distinction is important since it
reveals that an information nudge that is �good news� about the value of an action relative to
average beliefs or that is �good news� about the value of an action for the majority of people
need not increase the number of people who take the action. The results suggest that to
predict whether a nudge will be successful at changing behavior requires knowing something
about the distribution of beliefs of agents and being able to identify who is at the margin.
This �nding highlights the particular value of measuring the distribution of agent beliefs.
However, even if a researcher is unable to measure beliefs (e.g. because it is prohibitively
costly or otherwise unfeasible), our theory suggests it might still be possible to infer beliefs
at the margin. Embedded in the theory is a useful indicator of beliefs of those at the margin:
the take-up rate in the untreated group. We show that the percentage of people who take-up
the action in the untreated group suggests whether the prior beliefs of people at the margin
are more or less positive than the average belief in the population.6
The model therefore suggests a speci�c shape of the relationship between the take-up rate
in the untreated group (i.e., the �baseline�) and the sign and size of the treatment e�ect for
a particular nudge. In particular, in settings with di�erent baselines (and with information
nudges that aim to increase beliefs for a majority of the population, typically the goal of
such a study), as the baseline increases from 0: the treatment e�ect will decrease from 0,
then increase back to 0 (when the information provided in the nudge is identical to the belief
deliver expected outcomes and thus might need mode detailed models to help elucidate when we shouldexpect treatment e�ects. For example, nudges in choice architecture do not always deliver the expectedresults (see, e.g., Kessler and Roth (2015) on �yes-no� vs. �opt-in� choice frames for active choice organdonation requests of the type made at Departments of Motor Vehicles). See Beshears et al. (2015b) for ananalysis of which sub-populations will respond to defaults in 401(k) contribution levels.
5This argument is often made in settings without good information on beliefs of individuals and so it ishard to validate.
6To give some intuition for this result, in settings without information on beliefs (e.g. from a survey ofthe untreated), we assume that when the proportion taking the action in the untreated group is low (high),the marginal agents are those with higher (lower) beliefs about the value of taking the action. In this way,if the proportion choosing the action is low (high) in the untreated group, the treatment is more likely todecrease (increase) beliefs.
3
of the people a the margin), then continue to increase, and then decrease to 0 as the baseline
approaches 1.
We support our theory, and this pattern between baseline and treatment e�ects, with a
meta-analysis of informational nudges on binary outcomes in the literature. We �nd that even
across very di�erent experimental settings, the magnitude of the treatment e�ect is predicted
by take-up in the untreated group in the way our theory predicts. Consequently, our theory
rationalizes diverse �ndings from informational interventions present in the literature and
helps to answer the question about why some information nudges �nd null or negative results.
The meta-analysis, with the current data set (more data is currently being collected), comes
with the caveat that we are underpowered to test predictions about the sign of the treatment
e�ect.
We develop some extensions of the model, including extending it to environments when
agents make continuous choices. In those settings, rather than being concerned about people
at the margin, the treatment e�ect depends fundamentally on the people whose marginal
utility is more sensitive to changes in information. In settings of continuous choice, however,
the standard intuition is closer to accurate � the sign of the treatment e�ect depends on
whether information is �good news� for the average agent.
Our approach introduces a convenient methodology for dealing with Bayesian updating,
which often has tractability issues. For instance, Chambers and Healy (2012) show that for
any values of ν, µprior, and µpost, there exist a prior distribution f and an unbiased signal
distribution n (i.e.´ν · n (ν |x) · dν = x) such that upon observing a signal realization ν,
the subject updates her prior expectation from µprior to µpost. Surprisingly, while the theory
of how to compare the posteriors of two subjects who di�er only in the signals they receive
is well known (Milgrom 1981), little has been written about how to compare a subject's
prior to his posterior upon receiving a signal (Featherstone 2015). In this paper, we address
this by formalizing the idea of an information nudge as a signal which has little information
content. In this limit, a simple perturbation theory argument allows us to compute a closed
form expression for the posterior moments in terms of the prior moments. This approach
seems likely to be useful in applied theory more broadly.
While we are focusing on information interventions, other nudges may work partially
through information channels. For example, the speci�c choice architecture selected by
a policy maker could signal information to decision makers about what the policy maker
thinks is best. In addition, reminders, which are assumed to work through inattention (e.g.,
Taubinsky (2014)) have been shown to a�ect beliefs about the probability others take a
certain action (see, e.g., Del Carpio (2013)) and so might also work through an information
channel. To the extent that information is active, the main insights of our model would still
4
apply.
2 A model of information nudges
Our basic model of information nudges is simple. Each agent has a utility function u (x)
and a prior f (x), where x is a scalar outcome. Agents �take-up� the �behavior� if they
expect u (x) to be non-negative, and do not take-up otherwise; that is, utility is de�ned
net of the agent's best alternative. �Nudged� agents receive a common scalar signal of x,
denoted by ν, which is distributed according to the density n (ν |x). Agent heterogeneity
across the population in utility and prior will be summarized by a joint distribution g (· · · )over the moments of the prior (µ ≡ E [x], σ2 ≡ Var [x], etc.) and the derivatives of the utility
function, evaluated at the population mean of the prior expectation. Ultimately, we will be
concerned with two quantities aggregated across the population. The �rst is the fraction of
the population that takes-up without being nudged, which we will call the baseline take-up
rate and denote by β. The second is the di�erence between the fraction who take-up when
nudged and the fraction who take-up without being nudged, which we will call the treatment
e�ect and denote by τ .
The next six subsections �esh out this basic model. In the �rst, we introduce an intuitive
condition that justi�es our model assuming that utility is a function of x alone. In the second
subsection, we introduce a method for approximating the moments of the posterior from the
moments of the prior in the limit where the signal is an information nudge, which we will
formally de�ne in terms of the information content of the signal. In the third subsection, we
derive a condition for whether an agent will take-up in terms of the moments of his prior,
the derivatives of his conditional expected utility, and whether he was nudged. In the fourth
subsection, we will discuss suitable restrictions on how we model the population of agents
with the distribution g. In the �fth subsection, we derive a formula for the treatment e�ect.
Finally, in the sixth subsection, we discuss the assumptions made in the �rst �ve subsections
and discuss the extent to which they are dispensable, as well as what they mean for our
model.
2.1 Interpreting u (x)
Although the information is about x, it is not necessarily the case that x is all that an agent
cares about. In fact, it might even be that an agent doesn't care about x directly, but only
cares about the signal that x sends about other parameters. To capture these intuitions,
consider letting agents have a utility function u (x, y) that is a function of both x as well as
5
a vector of indirectly signaled arguments y. We de�ne the expected utility conditional on x
as
u (x) ≡ E [u (x, y) |x] .
We usually think of information as changing a subject's beliefs, but not her utility function.
To ensure that u (x) doesn't change when the signal is received, we assume that the signal
is independent of y conditional on the true value of the x, that is, ν ⊥ y | x.
Proposition 1. If ν ⊥ y | x, then
• To get the marginal posterior over x, we can simply apply Bayes' rule directly to the
marginal prior over x, that is, f (x | ν) = f(x)·n(ν |x)´f(x)·n(ν | x)·dx .
• Expected utility conditional on x and ν is the same as expected utility conditional on
x, that is, E [u (x, y) |x] = E [u (x, y) |x, ν].
See appendix for formal proof. Intuitively, this conditional independence assumption
rules out the possibility that the signal conveys independent information about both x and
y. Put di�erently, ν can convey information about y � it just can't convey information
about y to a subject that already knows x. When we interpret ν as a noisy signal of x,
this is particularly sensible. So, utility being a function of x alone is justi�ed so long as we
interpret it as the expected utility conditional on x.
2.2 How beliefs respond to an information nudge
Bayes' rule dictates that a nudged agent should have a posterior belief given by
f (x | ν) =f (x) · n (ν |x)´f (x) · n (ν | x) · dx
.
Unfortunately, this doesn't tell us much without further restrictions. For instance, Cham-
bers and Healy (2012) show that for any values of ν, µprior, and µpost, there exist a prior
distribution f and an unbiased signal distribution n (i.e.´ν · n (ν |x) · dν = x) such that
upon observing a signal realization ν, the subject updates her prior from µprior to µpost. Sur-
prisingly, while the theory of how to compare the posteriors of two subjects who di�er only
in the signals they receive is well known (Milgrom 1981), little has been written about how
to compare a subject's prior to his posterior upon receiving a signal (Featherstone 2015).
To gain a foothold on this problem, previous literature has tended to make speci�c
functional form assumptions (cf. Morris and Shin 2002; Angeletos and Pavan 2007; Cornand
and Heinemann 2008). Instead of going this route, we will make two alternative assumptions.
6
The �rst formalizes what it means for ν to be a signal of x: given any realization ν, that
realization should be a maximum likelihood estimator of x, given a uniform prior. This
maximum likelihood assumption means that ν = arg maxx
n (ν | x), which requires that
n2 (ν | ν) = 0 and n22 (ν | ν) < 0. The second assumption formalizes the de�nition of a
nudge: it should only induce a slight di�erence between posterior and prior. Intuitively, this
happens when the signal is only weakly informative, that is, when the distribution of the
signal only weakly changes with x.7 Mathematically, consider the asymptotic as ε → 0 of
a family of signal distributions based on the original: nε,ν0 (ν |x) ≡ n (ν | ν0 + ε · (x− ν0)).
The parameter ν0 serves as a way to keep variation in the signal distinct from variation in
the center about which a Taylor expansion will ultimately be made. Clearly, if take the
ε→ 0 limit and then substitute ν0 = ν, the posterior approaches the prior, since n0,ν (ν |x)
remains constant as x varies. The information nudge approximation is to compute the
ε → 0 asymptotic, substitute ν0 = ν, and then truncate to leading order in ε. Using this
approach, the signal distribution becomes8
[nε,ν0 (ν |x)]ε→0|ν0=ν ≈[n (ν | ν0) + n2 (ν | ν0) · (x− ν0) · ε+
1
2· n22 (ν | ν0) · (x− ν0)2 · ε2
]ν0=ν
,
= n (ν | ν) ·{
1 +1
2· n22 (ν | ν)
n (ν | ν)· (x− ν)2 · ε2
}.
Further, if we re-scale ε by a factor of√−n22(ν | ν)
n(ν | ν), the expression in the curly brackets
becomes proportional to 1 − 12· ε2 · (x− ν)2 , which is exactly what we would have gotten
had we started by assuming that n (ν |x) was the normal density with mean x and very
large variance. In other words, to leading order, nudges look normal. Going forward, we
7Although we base our formalization of nudge on the latter intuition, it also makes the former moreprecise: if we use the nudge approximation, straightforward computation shows that the Fisher informationof the signal becomes small, that is, I (x) ≡ E
[∂∂x {log n (ν |x)}
∣∣x] ∼ O (ε4).
8Note that the integral of this signal distribution with respect to t is not equal to one. This demonstrates asubtlety to our approach: the ν0 = ν substitution should not be taken until after any operation that treats ν
as a variable independent of ν0. For instance, instead of simply´n (ν | ν)·
{1 + 1
2 ·n22(ν | ν)n(ν | ν) · (x− ν)
2 · ε2}·dν,
the integral of the signal distribution should �rst be computed leaving ν0 in place, since the integral is overν and not ν0:
ˆ [n (ν | ν0) + n2 (ν | ν0) · (x− ν0) · ε+
1
2· n22 (ν | ν0) · (x− ν0)
2 · ε2
]· dν
=
{1 +
ˆn2 (ν | ν0) · dν · (x− ν0) · ε+
ˆn22 (ν | ν0) · dν · 1
2· (x− ν0)
2 · ε2
}= 1 +
∂
∂ν0
[ˆn (ν | ν0) · dν
]· (x− ν0) · ε+
∂2
∂ν20
[ˆn (ν | ν0) · dν
]· 1
2· (x− ν0)
2 · ε2 = 1.
7
will re-scale ε so that n (ν |x) ∝ 1 − 12· ε2 · (x− ν)2.9 This approximation will allow us to
make signi�cant headway against the intractability of the general Bayesian update formula.
Speci�cally, now we can calculate any moment of the posterior. Let µ ≡ E [x] and σ2 ≡E[(x− µ)2] represent the mean and variance of the prior, and further let γ1 ≡
E[(x−µ)3]σ3 and
γ2 ≡E[(x−µ)4]
σ4 − 3 represent the prior's skewness and excess kurtosis.10 Then,
Theorem 1. The �rst two moments of the posterior, to leading order in ε2 are
E [x | ν] = µ+ σ2 ·{
(ν − µ)− 1
2· σ · γ1
}· ε2,
Var [x | ν] = σ2 −{σ4 ·
(1 +
1
2· γ2
)− σ3 · γ1 · (ν − µ)
}· ε2.
If γ1 and γ2 are O (ε2) small, then these moments reduce to
E [x | ν] = µ+ σ2 · (ν − µ) · ε2,
Var [x | ν] = σ2 − σ4 · ε2.
Theorem 11 in the appendix provides a general formula for any moment of the posterior,
but for the purposes of this paper, the �rst two moments will su�ce. For the posterior
expectation, the intuition behind the �rst term inside the curly brackets is obvious, but the
intuition for the second is more subtle. Looking to the posterior variance, we see that an
informative signal tends to decrease it from the prior variance. If the prior is asymmetric,
this decrease in variance will disproportionately remove weight from the side the prior skews
towards, which will in turn push the posterior expectation away from that skew. For the
posterior variance, the �rst term in curly brackets is due to the fact that an informative signal
�reassures� subjects. The excess kurtosis term shows that subjects are more reassured when
their prior has more tail risk.11 The skewness term serves as a correction for asymmetry in
tail risk. To see this, consider the case where ν > µ. If γ1 > 0, then the signal pushes the
prior towards the side on which it has the fatter tail. As such, the posterior variance doesn't
decrease as much. If γ1 < 0, then the signal pushes the prior towards the thinner tail, and
the posterior variance decreases even more.
Although it is interesting to note how asymmetry and tail risk a�ect even the �rst two
9Note that di�erent subjects will have di�erent re-scaling if they have di�erent beliefs about the magnitude
of n22(ν | ν)n(ν | ν) . We will ignore this potential heterogeneity going forward, but it could be easily incorporated
into the theory we present.
10The term �excess kurtosis� refers to the fact that a normal distribution hasE[(x−µ)4]
σ4 = 3.11Note that they are, in fact, reassured, since Rohatgi and Székely (1989) proves that the inequality
1 + 12 · γ2 ≥ 0 holds generally.
8
moments of the posterior, for many contexts, we might think that those concerns are unim-
portant, either due to the underlying distributions, or due to bounded rationality. The
intuition behind considering the case where γ1 and γ2 are O (ε2) small is that, for a normal
distribution, these quantities are both zero. For a more general symmetric distribution, γ1 is
zero. So, we will call the prior near-normal if its third and higher central moments are close
to what they would be if the distribution were normal with the same mean and variance,
and we will call the prior near-symmetric if its odd central moments are close to zero.12
2.3 How the individual responds to an information nudge
The previous subsection outlined a method for computing the moments of the posterior given
the moments of the prior. Of course, we are concerned not just with these moments, but
how they a�ect the behavior of our subjects. Fortunately, we can �nd this e�ect by taking
the Taylor series of u (x) centered at the average µ across the population, E [µ]:
u (x) = u′ (E [µ]) ·
{−θ + x− α1 ·
[1
2· (x− E [µ])2 +
∞∑n=3
(−1)n
n!·n−1∏j=2
αj · (x− E [µ])n]}
,
where θ ≡ E [µ] − u(E[µ])u′(E[µ])
, and the αj ≡ −u(j+1)(E[µ])
u(j)(E[µ])are the various ordered coe�cients of
absolute risk aversion, e.g. α1 is the coe�cient of absolute risk aversion, α2 is the coe�cient
of absolute prudence, α3 the coe�cient of absolute temperance, etc. (Kimball 1990, 1992;
Denuit and Eeckhoudt 2010). Again, although risk aversion and its higher-order counterparts
might be interesting in some situations, we think the most interesting applications of our
theory will be when risk aversion can be ignored. The near risk-neutral approximation
is that the αj are O (ε2) small, so that to leading order, the utility is quadratic.
Going forward, we will refer to θ as the threshold, since the part of utility that is
a function of beliefs must be above θ for an agent to take-up.13 This also means that θ
represents the best outside option that an agent has. In the near risk-neutral approximation,
to leading order, an untreated agent will take-up when µ− 12·α1 ·E
[(x− E [µ])2] ≥ θ, while
a treated agent will take-up when µ− 12·α1 ·E
[(x− E [µ])2
∣∣ ν]+σ2 · (ν − µ) · ε2 ≥ θ. Hence,
to leading order, an agent with prior expectation µ is persuaded to take-up whenever µ is
less than σ2 · (ν − µ) · ε2 above µ− 12· α1 ·E
[(x− E [µ])2].14 This will prove useful when we
12Recall that the nth central moment is E [(x− µ)n] where µ = E [µ]. Note that the �rst central moment
is zero, by de�nition.13Note that this assumes that u′ (E [µ]), which is basically equivalent to assuming that higher x is �good
news� about taking-up.14The width of the range does not include 1
2 · α1 ·(E[(x− E [µ])
2∣∣∣ ν]− E
[(x− E [µ])
2])
because that
term is O(α1 · ε2
). This is because the fact that f (x | ν) = f (x)+O
(ε2)implies that the di�erence between
9
aggregate the number of agents persuaded by the nudge across the entire population.
2.4 Modeling a population
So far, we have discussed how information nudges a�ect the decision-making of the individual.
Now, we want aggregate those changes in behavior across an entire population of agents.
Essentially, this requires assuming some joint distribution g over prior moments, i.e. µ, σ2,
etc., and utility parameters, i.e. θ, α1, etc. Depending on what approximations are made
along the way, some of these parameters will not enter into the model, and as such, can
simply be marginalized out. For our purposes, we will only need a distribution over θ, µ,
and σ2.
Before moving on, however, we think a few restrictions on g help to focus the discussion.
First, if all agents draw their priors from the same distribution (i.e. the same random process
of information acquisition), then we might expect that agents' beliefs are independent of their
outside options, that is, θ is independent of (µ, σ2). We call this the common information
acquisition assumption.
We will also assume that µ is independent of σ2. To motivate this, note that the
population-averaged posterior expectation is given by
E[µ+ σ2 · (ν − µ) · ε2
]= E [µ] + ε2 ·
{E[σ2]· (ν − E [µ])− Cov
(σ2, µ
)}.
We might think that, on average, agents should update towards the signal, that is, if ν is
below the population-averaged prior, then the posterior should be below the prior, and if ν
is above the prior, then the posterior should be above the prior. Looking to the formula, if µ
and σ2 are su�ciently positively correlated, then even when the signal is above the popula-
tion averaged prior expectation, we might see the population average posterior expectation
decrease from the prior. Intuitively, if those with prior expectations above the signal are suf-
�ciently more swayed than those with prior expectations below the signal, then the net e�ect
of the signal will be to decrease the population averaged posterior expectation, regardless of
the signal. Keeping µ independent of σ2 prevents this by ensuring that Cov (σ2, µ) = 0. We
call this assumption the update-towards-the-signal assumption.
Together, our two assumptions dictate that the population distribution of agent parame-
ters can be written as g (θ, µ, σ2) = t (θ) ·m (µ) ·s (σ2), that is, as the product of independent
threshold, expectation, and con�dence distributions. Although this may seem extreme, the
results we will derive are not knife's-edge-reliant on these independence assumptions; in fact,
any posterior and prior expectation must be O(ε2).
10
both could be replaced by the assumption that G (θ, µ, σ2)− T (θ) ·M (µ) · S (σ2) is O (ε2)
small, which we call the near independence assumption.
2.5 How the population responds to an information nudge
Among admits whose prior mean and variance are µ and σ2, the fraction persuaded by the
treatment should be the product of the threshold density evaluated at µ, i.e. t (µ), and the
width of the range of thresholds that are persuaded, which is to leading order, σ2·(ν − µ)·ε2.15
Now, to get the fraction persuaded across the entire population, i.e. the treatment e�ect,
we simply take the expectation of our product over the population distribution of (µ, σ2).
Theorem 2. To leading order, under the assumptions made thus far (and summarized in
the next section), the treatment e�ect is equal to either of the following expressions:
τ = ε2 · E[σ2]· E [t (µ)] · (ν − E [µ | θ = µ]) ,
= ε2 · E[σ2]· E [m (θ)] · (ν − E [θ | θ = µ]) .
The second expression comes from the fact that θ ∈ (µ, µ+ σ2 · (ν − µ) · ε2) is (to �rst order
in ε2), equivalent to µ ∈ (θ + σ2 · (ν − θ) · ε2, θ). Conditioning on θ = µ will henceforth
be called conditioning on the marginal agent, as agents with θ = µ are exactly the ones
convinced by nudges. Hence, we have shown that the treatment e�ect's sign is determined
by whether the signal is above or below the average prior of the marginal agent, and its
magnitude is a function of �ve things: the nudge realization (ν), the information content of
the nudge (ε2), how many marginal agents there are (E [t (µ)]), how easily persuaded agents
are on average (E [σ2]), and how far the signal is from the average marginal agent's belief
(ν − E [µ | θ = µ]).
2.6 Discussion of assumptions
In explaining our approach to modeling information nudges, we have introduced assumptions
as they are needed. For convenience, we list them here. The following assumptions are
indispensable to our approach.
Conditional independence: Conditional on the true value of the information being sig-
naled, the signal provides no additional information about other arguments of agents'
utility functions. Without this assumption, we cannot think only in terms of the
quantity that is signaled.
15For a reminder of why this is so, see Section 2.3, and speci�cally Footnote 14.
11
Maximum likelihood: Given one observation of the signal and a uniform prior, that ob-
servation is a maximum likelihood estimator of the underlying x. This assumption is
relatively innocuous: it mainly serves to rule out signals that need to be de-biased, like
ν = x+ 3 with probability 1.
Information nudge: The signal distribution changes very slowly in the true value of the
information being signaled. Speci�cally, we assume that the signal distribution condi-
tional on the true value is the leading order asymptotic in ε of n (ν | ν0 + ε · (x− ν0)),
evaluated at ν0 = ν. Without this assumption, we cannot conclude much of anything
about how the posterior relates the prior.
The rest of our assumptions are made to help the model point at what we think of as the
most intuitive application of the framework. These assumptions can be replaced with others,
and in fact, we will explore a few such alternative models in Section 6 of this paper.
Near symmetric: The skewness of the prior is O (ε2) small. This allows us to ignore
the possibility that the signal �reassures� agents about upside and downside tail risk
asymmetrically.
Near risk-neutral: The coe�cient of absolute risk aversion, as well as the higher-order
coe�cients of absolute risk aversion are O (ε2) small. This allows us to abstract away
from the fact that a signal serves to �reassure� in addition to changing expectations.
Near independent: The di�erence between the joint distribution of (θ, µ, σ2) and the prod-
uct of that distribution's marginals is O (ε2) small. This assumption is a weaker form
of two more intuitively motivated assumptions. The �rst is a common information
acquisition process: across the population, the threshold θ is independent of the belief
parameters (µ, σ2). This allows us to ignore the issues that would arise if agents with
higher beliefs also had better outside options. The second is the update towards the
signal assumption: across the population, σ2 is independent from µ. This allows us to
ignore the issues that would arise if the malleability of an agent's beliefs were strongly
correlated with how optimistic they are about the value of x.
2.7 Indirect information nudges <not polished, but the ideas are
correct>
Our theory thus far has interpreted the nudge signal ν as being on the same scale as the
quantity it provides information about, x. Intuitively, this means that if x is something like
the post-college salary an agent will get if she majors in economics, then ν can be naturally
12
interpreted as a salary, and given a uniform prior, is her best guess at x. Although this is
the most straightforward way to think about information nudges, it is possible to interpret
them more broadly.
For instance, consider our agent who is considering an economics major. She cares about
how much money she will make after college, but doesn't know what that will be. Say she
attends an information session about the major at which she meets 80% of the graduates
from the past 5 years. It is rude to ask the alumni she meets at the session about their
salaries, but she can observe the fraction that are wearing expensive suits. In the most
straightforward application of our model, x is the fraction of alums who can a�ord expensive
suits, and ν is the fraction she observes at the information session. Clearly, observing the
suits of 80% of the universe of economics alums is quite informative about the overall fraction
of economics alums with expensive suits. As such, a straightforward application of the model
is inappropriate. Still, we might think that strong information about the suits that economics
alums wear provides relatively little information about the average salary of an alum. Of
course, how to map the fraction with nice suits into a guess at the average salary is not
obvious. A bit of math will clarify.
Denote the average income of alums by z, the fraction that wear nice suits by x, and our
agent's joint prior by f (x, z). First, we show that a highly informative signal about x is a
weakly informative signal about z if x and z are not strongly dependent. By the same logic
used in Section 2.2, we model this weak dependence by expanding f (x | z) as
f (x | z) ≈ f (x | z0) + f2 (x | z0) · (z − z0) · ε+1
2· f22 (x | z0) · (z − z0)2 · ε2,
where z0 has yet to be determined. From there, we can derive the likelihood of signal ν
conditional on z, n (ν | z), from the likelihood of signal ν conditional on x, n (ν |x):
n (ν | z) =
ˆn (ν | x) · f (x | z) · dx,
≈ˆ
n (ν | x) ·[f (x | z0) + f2 (x | z0) · (z − z0) · ε+
1
2· f22 (x | z0) · (z − y0)2 · ε2
]· dx,
≈ n (ν | z0) + n2 (ν | z0) · (z − z0) · ε+1
2· n22 (ν | z0) · (z − z0)2 · ε2.
Now, de�ne the maximum likelihood estimator of z given signal ν as
z (ν) ≡ arg maxz
n (ν | z)
so that n2 (ν | z (ν)) = 0 and n22 (ν | z (ν)) < 0. Then, if we choose z0 = z (ν), our approxi-
13
mation of the signal distribution conditional on z can be expressed as
n (ν | z) ≈ n (ν | z (ν)) +1
2· n22 (ν | z (ν)) · (z − z (ν))2 · ε2,
= n (ν | z (ν)) ·{
1− 1
2· (z − z (ν))2 · ε2
},
where ε ≡√−n22(ν | z(ν))
n(ν | z(ν)). This is essentially the same expansion we assumed in Section 2.2,
except now, the signal must be �interpreted� through the maximum likelihood estimator,
z (ν). Hence, the moment-update formulas of Theorem 1 remain the same, except that all
prior moments are relative to the prior belief about z, and ν is replaced with z (ν).
Continuing to parallel our earlier derivation, we consider how the logic of Section 2.3
proceeds under our alternative assumptions. Here, we will need to model our agent caring
about whether alums wear nice suits only because this fraction provides information about
alums' salaries.16 This means variation in x does not a�ect the full utility to leading order,
that is u (x, z, y) = u (x′, z, y) +O (ε) for all x, x′. If this is the case, then we can essentially
think of our agent as responding to a weak signal about z and completely ignore x.
Finally, we look at how the logic of Sections 2.4 and 2.5 proceed under our alternative
assumptions. The main wrinkle here is that z (ν) could potentially vary from agent to
agent. One way to deal with this is to assume that all agents will agree on how the direct
signal about x, ν, can be interpreted to a direct signal about z, z (ν). If this is the case,
then all results in this paper carry through replacing ν with z (ν). Alternatively, one could
assume heterogeneity in the interpretation, but in a way that is independent from the other
parameters of the model, in the same way that we assume that µ is independent of σ2. The
net result of this approach would be to replace ν in the theory with the population-averaged
interpretation of ν, E [z (ν)].
So, in Sections 2.2 through 2.5 we discussed the situation when a signal is weakly in-
formative and naturally interpreted. In contrast, this section has discussed the situation
when a signal provides strong information about something that agents do not care directly
about, but that weakly correlates with something they do care about. Both of these situ-
ations can be analyzed with our model of information nudges. The situation in which our
model does not apply is when the signal provides strong information about something agents
care directly about. To make the distinctions more concrete, consider a two-player public
goods game in which players are motivated by some mixture of self-interest and reciprocity.
If Player 1 were told the average contribution of all Player 2s across a large experiment
16We are ruling out �directly� caring about whether alums wear nice suits, that is, if our agent knewall other pertinent information about alums, varying the fraction that wear nice suits would not a�ect herutility.
14
(not just the one to which he is matched), the theory described in Sections 2.2 through 2.5
would apply directly. If Player 1 were told Player 2's gender, then the theory described in
this section would apply, assuming that gender correlates weakly with public goods giving,
and Player 1's reciprocity has no gender-speci�c component, such as �chivalry�. Finally, if
Player 1 were told Player 2's exact contribution, as in Potters, Sefton and Vesterlund (2005,
2007), our theory would not apply at all.
3 Comparative statics
Theorem 2 can help us to understand what makes the treatment e�ect bigger or smaller.
Looking to the primitives of our model, there are �ve main things that can change: the
information content of the nudge, the signal realization, the average con�dence that agents
have in their beliefs, the distribution of the expectation of those beliefs, and the distribution
of outside options. In this section, we will explore how each of these a�ect the treatment
e�ect.
3.1 Signal, con�dence, and strength of nudge
Looking to Theorem 2, it is simple to derive the following elasticities:
∂τ
∂ε2· ε
2
τ= 1,
∂τ
∂E [σ2]· E [σ2]
τ= 1.
All are relatively intuitive. Increasing the parameters ε2 or E [σ2] by some factor simply in-
creases the magnitude of the treatment e�ect, be it positive or negative, by that same factor.
This makes sense, as neither of these parameters are changing the direction in which agents
update, but rather how much they update. Further it makes sense that these parameters
increase the magnitude of the treatment e�ect: ε2 represents the information content of the
nudge, and E [σ2] represents how unsure agents are about their prior expectation. The other
immediate comparative static is
∂τ
∂ν= ε2 · E [t (µ)] · E
[σ2].
It is intuitive that ∂τ∂ν
is always positive, as higher signals should always make taking-up more
attractive. Further, the e�ect of an increase in ν depends on how many agents are marginal
and how con�dent agents are in their prior expectations. Less obviously, the comparative
15
static implies that there are no diminishing returns to increasing ν.
3.2 Thresholds and beliefs
We can model shifts in thresholds and prior expectations by assuming that there exist param-
eters λt and λm that shift the likelihood ratios of those distributions. This creates families
ordered in the monotone likelihood ratio (MLR) sense, which we denote by t (θ |λt) and
m (µ |λm).17 We also want our likelihood ratio shifters to move their distributions arbi-
trarily high or low, so we set their domain to (−∞,∞) and assume, for any µ or θ, that
limλt→−∞
T (θ |λt) = limλm→−∞
M (µ |λm) = 1 and limλt→∞
T (θ |λt) = limλm→∞
M (µ |λm) = 0. These
properties will allow us to more easily derive the comparative statics of increasing thresholds
and beliefs.18
3.2.1 Increasing thresholds
Everything we have discussed thus far is invariant to an increasing transformation of λt.
As such, for this subsection, we will de�ne our likelihood ratio shifter such that λt =
E [µ | θ = µ; λt]. Looking to Theorem 2, this means that τ (λt) ∝ E [t (µ |λt)] · (ν − λt),where the constant of proportionality consists of terms that don't vary in λt. Given this
setup, we already can already say a lot about the general shape of the function. Speci�cally,
Proposition 2. The following are properties of τ (λt):
1. limλt→−∞
τ (λt) = limλt→∞
τ (λt) = 0.
2. When λt < ν, τ (λt) > 0, and when λt > ν, τ (λt) < 0.
3. If both m (µ) and t (θ |λt) are log-concave, then the magnitude of the treatment e�ect
has exactly two peaks: one in the negative treatment e�ect range and one in the positive
treatment e�ect range.
See the appendix for a full proof of the theorem. Intuitively, Part 1 holds because as λt
approaches ±∞, the main mass of thresholds is so far from the main mass of beliefs, that
there is a vanishing fraction of the population that is marginal. Part 2 is obvious, since
sgn {τ (λt)} = sgn {ν − λt}. Part 3 makes use of a common restriction on distributions: that
17A family m (µ |λm) is ordered in the MLR sense if, for any λ′m > λm and µ′ > µ,m(µ′ |λ′m)m(µ |λ′m) >
m(µ′ |λm)m(µ |λm) .
This property is commonly used in conjunction with modeling Bayesian updating. Important for our purposesis that it implies �rst-order stochastic dominance, that is, M
(µ∣∣λ′µ) < M (µ |λµ).
18Technically, these assumptions allow us to use the shifters to move the mean prior expectation of marginalagents across the entire range (−∞,∞). See Lemma 6 in the appendix for a formal proof.
16
its natural logarithm is concave. Many commonly used distributions have this property,
including the uniform, normal, logistic, extreme value, and Laplace distributions (Bagnoli
and Bergstrom 2005). Essentially, log-concavity acts as a strong form of single-peakedness
for positive functions.19 Another good reason to think that log-concavity �ts well into our
framework is the fact that translating a distribution to the right increases it in the monotone
likelihood ratio sense if and only if it is log-concave.20 Once we have log-concavity, the single-
peakedness of the treatment e�ect on its positive and negative ranges is a straightforward
application of the fact that integrating an argument out of a log-concave function leaves
behind a log-concave function (Prékopa 1973).
3.2.2 Increasing beliefs
Now, we consider what happens when we shift λm, holding λt constant. As before, we de�ne
our shifter such that λm = E [µ | θ = µ; λm], leaving us with τ (λm) ∝ E [m (θ |λm)]·(ν − λm).
By logic similar to that of the previous section, we can characterize the shape of τ (λm).
Speci�cally,
Proposition 3. The following are properties of τ (λm):
1. limλm→−∞
τ (λm) = limλm→∞
τ (λm) = 0.
2. When λm < ν, τ (λm) > 0, and when λm > ν, τ (λm) < 0.
3. If both m (µ |λm) and t (θ) are log-concave, then the magnitude of the treatment e�ect
has exactly two peaks: one in the negative treatment e�ect range and one in the positive
treatment e�ect range.
So, the treatment e�ect as a function of the shifters looks the same for both: it starts at zero,
increases to a peak, then decreases to zero, then decreases to a trough, and then increases
to zero again.
3.3 Proxying for shifters using the baseline take-up rate
Although the comparative statics of Subsections 3.2.1 and 3.2.2 are interesting, it would
be di�cult to take them to the data, as the shifters correspond to the expected threshold
19To see that log-concavity implies single-peakedness, note that ∂2 log f(x)∂x2 < 0 and f > 0 imply that
f ′′ < f ′2
f . At an optimum, f ′ = 0, which means that f ′′ < 0, the second-order condition for a localmaximum. And if all local optima are maxima, then there can only be one local optimum. To see howlog-concavity is a strengthening of single-peakedness, note that Ibragimov (1956) shows that a function islog-concave if and only if, when convolved with another single-peaked function, the result is single-peakedas well.
20This is a straightforward consequence of the fact that ∂2 log f(x−t)∂x∂t = −∂
2 log f(x−t)∂x2 .
17
and expected belief of the marginal agent, and often, who is marginal is di�cult to surmise.
Fortunately, we will be able to proxy for our shifters with the baseline take-up rate, which
is given by the following two related-through-integration-by-parts expressions:
β (λt, λm) =
ˆT (µ |λt) ·m (µ |λm) · dµ,
=
ˆt(θ∣∣∣λt) · [1−M (
θ∣∣∣λm)] · dθ.
Intuitively, the �rst expression comes from the fact that for any belief µ, the fraction of
agents whose threshold is lower is T (µ |λt). The second expression has similar intuition.
The baseline is able to act as a proxy for the shifters because, for �xed λm, the baseline runs
monotonically from 0 to 1 as λt decreases (and for �xed λt, it runs monotonically from 0 to
1 as λm increases.21 Hence, when only λm or only λt vary, there is a one-to-one mapping
between shifter and baseline, which means we can restate our comparative statics in terms
of the treatment e�ect as a function of the baseline take-up rate, τ (β).
Theorem 3. If changes in β are being driven by changes in the threshold shifter λt, then
the following are properties of τ (β):
• There exists a β0 such that τ (β) ≤ 0 when β ≤ β0, and τ (β) ≥ 0 for β ≥ β0.
• If both m (µ) and t (θ |λt) are log-concave, then the magnitude of τ (β) has exactly two
peaks: one in the negative treatment e�ect range and one in the positive treatment
e�ect range.
If changes in β are being driven by changes in the prior expectation shifter λm, then the
following are properties of τ (β):
• There exists a β0 such that τ (β) ≥ 0 when β ≤ β0, and τ (β) ≤ 0 for β ≥ β0.
• If both m (µ |λm) and t (θ) are log-concave, then the magnitude of τ (β) has exactly
two peaks: one in the negative treatment e�ect range and one in the positive treatment
e�ect range.
Either way, limβ→0
τ (β) = limβ→1
τ (β) = 0.
This theorem provides a natural way to gain some simple intuition into when we expect
the treatment e�ect to be large or small. The general shape of the curve suggested by the
theorem can be seen in Figure 1.
21See Lemma 9 in the appendix for a formal proof.
18
Figure 1: τ (β) when changes in β are driven by changes in λt.
4 Making the comparative statics more precise
Often, in addition to simple comparative statics, we might want to have some idea of the
baseline take-up rate at which τ (β) goes from negative to positive, or the point at which a
positive e�ect goes from increasing to decreasing. To get precise answers to such questions,
we must make more speci�c functional form assumptions, and supplement them with belief
surveys. In this section, we will discuss how to model treatment e�ect with belief surveys of
di�erent coarseness of information.
4.1 The normal model
Assume that µ and θ are independent and normally distributed with means E [µ] and E [θ]
and variances Var [µ] and Var [θ]. Further assume that changes in β are driven by translations
in the threshold distribution (i.e. the distribution remains normal with the same variance
but a di�erent mean). We call this setup the normal model.
Theorem 4. In the normal model, the treatment e�ect as a function of the baseline is given
by
τ (β) = ε2 · E[σ2]· 1√
1 + η2· ϕ(Φ−1 (β)
)·
{z (ν) +
1√1 + η2
· Φ−1 (β)
},
where z (ν) ≡ ν−E[µ]√Var(µ)
is the z-score of the signal on the distribution of µ, and η ≡√
Var[θ]Var[µ]
19
is the ratio of the standard deviations of θ and µ. Further, since distributions in the normal
model are log-concave, Theorem 3 comes to bear, which means that we expect one zero, one
minimum, and one maximum. The zero of τ (β) is at
β0 = Φ(−√
1 + η2 · z (ν)),
while the minimum β− and the maximum β+ of τ (β) are at
β± = Φ
1
2· Φ−1 (β0)±
√(1
2· Φ−1 (β0)
)2
+ 1
.
See the appendix for a derivation. This functional form is actually what was plotted in
Figure 1. In many situations, η is an unknown parameter. Fortunately, our formulae still
tell give us bounds on where the extrema are located, and the ranges on which τ (β) is
monotonic. In the main text, we will present bounds that only require knowing whether the
signal is �good news� or �bad news� relative to the population distribution of beliefs. In the
appendix, we derive similar bounds for a known value of z (ν).
Proposition 4. In the normal model, when the signal is �good news�, that is, z(ν) > 0, we
know that, β0 ∈(0, 1
2
), which means that
• β+ ∈(
12,Φ (1) ≈ 0.841
)and β− ∈ (0,Φ (−1) ≈ 0.159),
• τ (β) is increasing on β ∈(0.159, 1
2
)and decreasing on β ∈ (0.841, 1).
Similarly, when the signal is �bad news�, that is, z(ν) < 0, we know that β0 ∈(
12, 1), which
means that
• β+ ∈ (0.841, 1) and β− ∈(0.159, 1
2
),
• τ (β) is decreasing on β ∈ (0, 0.159) and increasing on β ∈(
12, 0.841
).
The proof hinges on the fact that the extrema β± are both monotonically increasing in the
location of the zero, β0.22 See the appendix for a proof. Note that this theorem give very
stark results about whether τ (β) is increasing or decreasing on certain ranges. This can be
particularly useful as a guide for practitioners who are trying to determine whether to expect
22To see this, note that the derivative of the extrema locations is ∂β±∂β0
=
ϕ
(12 · Φ
−1 (β0)±√(
12 · Φ−1 (β0)
)2+ 1
)· 1
2 ·
(1±
12 ·Φ−1(β0)√
( 12 ·Φ−1(β0))
2+1
)· ∂Φ−1
∂β0(β0). None of the terms
in that product can be negative.
20
larger or smaller treatment e�ects in subgroups that have di�erent values for the baseline,
β.
For similar reasons, we might also care about what baselines should yield a larger treat-
ment e�ect than a given, reference baseline. De�ne the upper contour set of the treatment
e�ect as Υ (β | η) ={β : τ
(β∣∣∣ η) ≥ τ (β | η)
}and the lower contour set as Λ (β | η) ={
β : τ(β∣∣∣ η) ≤ τ (β | η)
}. These set-valued functions take a baseline β and return the set
of other baseline take-up rates at which we expect a treatment e�ect as big or small as it is
at β. In the normal model, these functions can be written as simple intervals since, as we
saw in Theorem 3, both τ (β) and −τ (β) are both single-peaked on the ranges that they are
positive. Further, we can derive inner bounds for these sets when η is unknown.
Theorem 5. De�ne b (η) implicitly as the solution to τ (b (η) | η) = τ (β | η) that isn't β.
Then:
• if the signal is good news (i.e. z (ν) > 0), and the treatment e�ect is positive and
decreasing at baseline take-up rate β, then Υ (β | η) = [b (η) , β], where b (η) is decreasing
in ηfor η ≥ 0 and hence Υ (β | 0) is an inner bound, that is Υ (β | 0) ⊆ Υ (β | η).
• if the signal is bad news (i.e. z (ν) < 0), and the treatment e�ect is negative and
decreasing at baseline take-up rate β, then Λ (β | η) = [β, b (η)], where b (η) is increasing
in ηor η ≥ 0 and hence Λ (β | 0) is an inner bound, that is Λ (β | 0) ⊆ Λ (β | η).
When η is unknown, Theorem 5 provides guidance about when a treatment e�ect will be
larger in magnitude. A conservative estimate can be obtained by setting η to be at the low
end of what is expected (see Co�man, Featherstone and Kessler (2014) for an application of
this approach with η = 1). Absent any information, η = 0 is the most conservative bound.
4.2 Belief and threshold surveys
If we have a belief survey that gives us m (µ), then we can simply guess some family for
t (θ |λt). From there, we can numerically integrate both the baseline take-up rate and
the treatment e�ect as a function of λt. If t (θ |λt) is ordered in the monotone likeli-
hood ratio sense, then λt simply serves to to parametrize the relationship τ (β), since
β (λt) =´ [
1−M(θ)]· t(θ∣∣∣λt) · dθ is a monotonically decreasing function. Moreover,
if we have a survey that provides the threshold distribution, then we can numerically inte-
grate treatment and baseline take-up rate in the same way, except now we will only need to
make an assumption about how λt moves the empirical distribution of thresholds. Simple
translation seems like the most likely option.
21
5 Interpreting the comparative statics
The comparative statics described in sections 3 and 4 provide intuition about how we expect
treatment e�ects to vary across groups with di�ering baseline take-up rates. In this section,
we use this intuiton to test our model with existing empirical results from the literature
on information nudges. (Note: Our current meta-analysis is limited to 62 data points from
13 papers, which we found in an initial search of the literature, we are in the process of
getting more data and will rerun the analysis upon collecting it all.) While the papers span
a variety of domains and interventions, the results are largely consistent with our theory.
Consequently, we are able to reconcile results across the literature on information nudges
and highlight why some attempts at a�ecting behavior with informaton fail to achieve the
desired goal.
5.1 Within and across experiments
Many empirical papers report treatment e�ects of their intervention across multiple sub-
groups. Our theory makes sharp predictions about which subgroups will display larger
treatment e�ects. Across subgroups in the same experiment, the realization and information
content of the signal remains constant, so the relevant comparative statics are con�dence,
summarized by E [σ2], threshold (e.g. quality of outside option), summarized by t (θ |λt),and optimism, summarized by β (µ |λm). For the �rst, we would expect treatment e�ect to
be independent of baseline take-up, since the baseline does not depend on E [σ2]. For the
second, we would expect a relation like that in Figure 1. For the third, we would expect
a similar relationship, but that went positive and then negative as baseline increased. In
practice, most subgrouping in empirical papers is done based on demographic characteristics.
Unless we have an ex ante reason to think that a demographic group would be particularly
optimistic or pessimistic about the value of take-up (e.g. if men and women were given
feedback about the likelihood of winning a tournament in which men's beliefs are overly op-
timistic relative to those of women), it seems safe to assume that when di�erent demographic
groups have di�erent take-up rates it is caused by di�erences in preferences for the action
relative to the outside option (e.g. because of di�erent tastes for the action or di�erent out-
side options). (With more data, one could imagine only using within-experiment variation
to test the model. At the moment, we must also rely on across-experiment variation, which
requires additional assumptions described below.)
Across experiments, baseline take-up rates di�er along with setting and associated nudge.
Consequently, when we think about moving from one setting (e.g. with a lower baseline) to
another setting (e.g. with a higher baseline) it is less clear whether we should think about
22
individuals in the latter group havingworse outside options (i.e. lower thresholds) or being
more optimistic about the value of the action (i.e. higher beliefs). For our theory to make
predictions about how baseline correlates with treatment e�ects across experiments, however,
we have to take a stand on whether thresholds or beliefs are shifting across experiments.
Fortunately, the type of experimental settings researchers select can act as a guide. Because
the standard intution is that information interventions work when nudges are higher than the
average beliefs, we expect most experiments that are actually undertaken to provide signals
that are relatively far to the right of the belief distribution. So while there is variation in
how far to the right the signal is relative to average beliefs, we can assume that, across
experiments, if the belief distribution moves to the right, so does the signal. Further, if
we think about the model thus far, beliefs, thresholds, and signals are only de�ned relative
to each other. That is, if we translate all of these quantities to the right, nothing changes.
Hence, broadly speaking, shifting the belief distribution and the signal in lock-step is roughly
equivalent to shifting the threshold distribution in the opposite direction. As such, even
across papers, we have reason to think that the threshold shifter comparative static is the
best guide, which allows us to use the baseline variation across experiments in the same
way we use it within experiments. In the next section, we will put this logic to the test by
plotting treatment e�ects against baseline take-up rates for all subgroups from a sample of
papers from the empirical nudge literature.
5.2 Empirical support
As a �rst check of the theory, we investigate if the functional relationship between baseline
rates and treatment e�ects is consistent with predictions from the theory. To simplify the
analysis, we assume that the papers in our data set nudge with information that is high
relative to the distribution of beliefs held by subjects. This is not a strong assumption, as
the papers in our analysis all intended to increase beliefs about the utility of the desired
outcome. Looking back to our theory, we predict that there will be weakly negative treatment
e�ects at very low baselines, and then an inverted U-shape of positive treatment e�ects at
other baselines.23 As we will discuss later, our current data is under-powered to test the
�rst part of this prediction. Consequently, in what follows we explore the second part of
this prediction by looking for an inverted U-shape with treatment e�ects increasing and then
decreasing with the baseline.
23Essentially, we expect a relation akin to that in Figure 1. To see this more formally, consider the normalmodel from Section 4.1. Theorem 4, shows that the more the nudge is large relative to the belief distributionof subjects (codi�ed by a large z (ν)), the closer the internal zero of the normal model, β0, is to zero. A bitmore algebra shows that as β0 → 0, the ratio of the depth of the trough of τ (β) to the height of the peakof τ (β) is zero.
23
This approach is not intended to be an iron-clad test of the theory, but rather a demon-
stration that the theory is consistent with the literature and serves to organize it. Further,
the data to which we have access are not ideal. First, without precise measurement and
reporting of every parameter in the formulas from Theorem 2, we cannot make precise pre-
dictions, e.g. the threshold for negative treatment e�ects, or the peak of the curve. Second,
and more importantly, the data set may be confounded by publication bias. Papers with an
insigni�cant or negative treatment e�ect may not ever be published, and as a result will not
make it into our analysis. This is problematic since our theory predicts bigger treatment
e�ects at intermediate baselines where variance of a binary variable is at its maximum � if
papers are only published with signi�cant results we might fail to see small treatment e�ects
at intermediate baselines. This concern is mitigated, however, since we have data reported
for secondary analyses (e.g. subgrouping or secondary outcomes) from published papers,
which are likely less prone to publication bias.
As dictated by our model, we analyze papers that satisfy three criteria: information
was experimentally provided, there was a binary outcome variable, and base rates in the
(no-information) control condition were reported. In all, we have 62 data points from 13
di�erent papers. We do not claim the papers collected are exhaustive. However, we have no
reason to believe these papers are not representative of the relationship between treatment
e�ects and baseline rates. For visual simplicity, we exclude three treatment e�ects with an
absolute value greater than 15 percentage points, leaving 59 data points to analyze.24
The average treatment e�ect in the sample is +3.0 percentage points (pp) with a median
of +2.7pp. The treatment e�eects vary widely, with a standard deviation of 3.1pp. However,
negative treatment e�ects comprise only eight of the 59 treatment e�ects. Seven of those
negative treatment e�ects are fairly insubstantial, with an absolute value less than two
percent, and six of the eight are within a standard error of zero. In short, the current
analysis will be underpowered to test which base rates, if any, produce negative treatment
e�ects.
Baseline take-up rates are skewed towards the low side of the unit interval, with a median
of 30% and an average of 37%. This may be by chance. It may also be that researchers
choose to experiment on low take-up rates, perhaps because this indicates a problem exists,
and there is much potential for improvement. Though this intuition is appealing, the lower
end of the unit interval is precisely where our theory is most pessimistic for substantial,
positive treatment e�ects, as shown in Figure 1.
24This restriction is mostly to avoid the noise introduced by large outliers, although some sort of restrictionof this form is supported by the theory, which is only valid for �nudges� � implying modest changes in beliefsand behaviors. Results are qualitatively consistent when these three data points are included.
24
-.05
0.0
5.1
.15
Siz
e of
Tre
atm
ent E
ffect
0 .2 .4 .6 .8 1Take-up Rate in Control Group
Allcott & Taubinsky (2015) Avitabile & de Hoyos (2015)
Bettinger et al (2011) Cai et al (2009)
Clark et al (2013) Coffman et al (2015)
Del Carpio (2014) Hastings et al (2015)
Hastings & Weinstein (2008) Jensen (2010)
Kuziemko et al (2015) Karadja et al (2014)
Nguyen (2008)
Each dot represents the treatment e�ect for a group with a particular baseline take-up rate. Papers identi�ed bycolor. Marker size is weighted by inverse standard error. To facilitate viewing, three treatment e�ects with absolutevalue greater than 0.15 are excluded from the analysis.
Figure 2: Treatment e�ect sizes (pp) by baseline take-up rates (%)
To give a raw look of the data, Figure 2 is a scatter plot of treatment e�ect sizes on the
vertical axis and baseline take-up rates along the horizontal axis. The size of the markers
is weighted by the inverse standard error of the treatment e�ect; larger dots represent more
precise estimates. Visually, the data roughly form the inverted U-shape predicted by the
theory. Going left to right, treatment e�ects start out quite small, then generally increase,
then decrease.
Regression analysis supports the visual. We run ordinary least squares regressions to �nd
the best-�t quadratic,
τ = a+ b1 · β + b2 · β2 + error,
weighting observations by the inverse of the squared standard error of the treatment e�ect
(Model I) or clustering at the paper level (Model II). Table 1 displays the results. The coef-
�cient b1 is positive and signi�cant, b2 is negative and signi�cant, and a is indistinguishable
from zero in both models. The estimated relationship is an inverted-U-shape, as base rates
increase, treatment e�ect sizes increase and then decrease. This can be seen visually in
25
D.V. = Size of Treatment E�ectI II III IV
Data restriction None β ∈ [0.159, 0.5]β (baseline take-up rate) 0.23 0.23 0.13 0.13
(0.03)*** (0.04)*** (0.04)*** (0.04)***β2 -0.26 -0.26
(0.04)*** (0.04)**Constant -0.00 -0.00 0.00 0.00
(0.01) (0.01) (0.01) (0.04)Inverse variance weighting? Yes No Yes NoClustered by paper? No Yes No YesNum. Results 59 59 34 34Num. Papers 13 13 9 9
*, **, *** denotes signi�cance at 0.1, 0.05, and 0.01, respectively.
Both baseline and treatment e�ect are measured in percentage points. For instance, the 0.23 coe�cient for β underModel I should be interpreted as a 0.23 percentage point change in treatment e�ect for a 1 percentage point changein baseline.
Table 1: The relationship between baseline and treatment e�ect.
Figure 3, which shows a smoothed local polynomial regression of the data.
Further, the models are in line with the more precise predictions made in Proposition 4,
which states that if the information provided is �good news�, i.e. it is greater than the mean
belief of the population, then in the normal model, the treatment e�ect is increasing with
baseline take-up rates on the range β ∈ (0.159, 0.5) and decreasing with baseline take-up
rates on the range β ∈ (0.841, 1). We only have one data point in (0.841, 1), so we focus on
the �rst interval. Indeed, as can be seen in Figure 3, treatment e�ects are increasing over
(0.159, 0.5). Further, Models III and IV of Table 1 estimate a linear �t of data from that
range, and estimates a signi�cant and substantial increase in treatment e�ects with respect
to baseline take-up rates. Results suggest a 1.2pp increase in treatment e�ect for every ten
point increase in baseline rate over that range (p < 0.01 in both regressions).
Overall, the data provide strong support for the theory. Due to current data limitations,
we are not able to test what, if any, baseline take-up rates predict negative treatment e�ects.
However, the data show strong evidence for the inverted-U relationship between baseline
take-up rates and treatment e�ects. Empirically, the intuition that low baseline take-up
rates may be an ideal environment for experimenting might be outweighed by the factors
identi�ed in our model.
26
0.0
1.0
2.0
3.0
4.0
5S
ize
of T
reat
men
t Effe
ct
0 .2 .4 .6 .8 1Take-up Rate in Control Group
The con�dence bands are drawn at the 95% con�dence level.
Figure 3: Estimated Relationship between Baseline Take-up and Treatment E�ect
6 Extensions to the model
Although the paper has thus far focused on the e�ect of revealing information to near-risk-
neutral agents making a single binary choice, our framework can be applied to a variety of
related settings. In this section, we elucidate �ve such extensions, focusing on additions to
the model that we think are particularly relevant for both practitioners and for academics
interested in understanding how nudges can a�ect behavior and, as in subsection 6.1, about
how such nudges e�ects welfare � a topic of growing interest (see, e.g., Allcott and Kessler
(2015)).
6.1 Welfare
Our basic model allows us to compute the treatment e�ect of the information nudge on
welfare, τW . To compute τW , note that only agents whose behavior is changed by the
nudge contribute, that is, agents with thresholds in the range θ ∈ [µ, µ+ σ2 · (ν − µ) · ε2].
Of those agents with some prior expectation µ, the fraction that are persuaded is hence
t (µ) · σ2 · (ν − µ) · ε2. Each of those agents ultimately prefers take-up to abstention by
27
κ · (x− θ), where κ ≡ u′ (E [µ]).25 So, for a particular value of µ, the expected welfare
change is t (µ) ·σ2 · (ν − µ) ·ε2 ·κ · (x− θ). Note that while behavior is determined by ex anteexpectations about x, welfare is determined by the value of x that is ultimately realized.
The overall e�ect on welfare can be derived by taking the expectation of our expression over
µ.
Theorem 6. For a given realization of x, the ex post treatment e�ect of the information
nudge ν on welfare is
τW (ν) = ε2 · E [κ] · E[σ2]· E [t (µ)] · (x− E [µ | θ = µ]) · (ν − E [µ | θ = µ]) ,
which is negative when the signal ν and the actual information x are on di�erent sides of the
expected belief of the marginal agent, E [µ | θ = µ]. However, the ex ante (i.e. x is known,
but not ν) treatment e�ect of the nudge on welfare is positive and equal to
E[τW (ν)
]= ε2 · E [κ] · E
[σ2]· E [t (µ)] · (x− E [µ | θ = µ])2 .
The two parts of the theorem illustrate the subtlety of the intuition that providing infor-
mation to agents can only help decision making and thus cannot make agents worse o�.
In expectation, a signal of x helps; however, there exist realizations of the signal that hurt
agents overall.
6.2 Attrition
In many binary choice situations, there is a time between the original choice and the realiza-
tion of x in which takers can renege or attrit.26 One reason for takers to attrit is the arrival
of new alternatives that are preferred to taking-up. In this sort of scenario, we should worry
not just about takers who were marginal about taking-up originally, but also about takers
who were infra-marginal at the time of original decision, but who might become marginal as
a new opportunity arrives.
To model this, assume that after taking-up, takers receive a new outside option, codi�ed
by a new threshold θN , distributed according to the density tN(θN). This new outside option
will cause a taker to renege if it is greater than his belief, µ. So, our agents who attrit have
25Intuitively, an agent's κ is the utility change she experiences due to a unit increase in x.26For example, Co�man, Featherstone and Kessler (2014) investigate the e�ect of an information nudge
that tells new admitted applicants of Teach For America about the program's matriculation rate in theprevious year. In this setting, potential teachers say yes to the program months before they actually startteaching (and often also many months before they start training) and so practitioners are concerned thatpotential teachers might say yes and then subsequently decide they would rather work elsewhere.
28
θ < µ < θN . The number of such agents is β −´T (µ) · TN (µ) ·m (µ) · dµ, and hence the
attrition rate is 1−´T (µ)·TN (µ)·m(µ)·dµ
β. Now, treatment will prevent attrition in agents with
θ < θN and θN ∈ (µ, µ+ σ2 · (ν − µ) · ε2). As before, we can multiply the width of this range,
σ2 · (ν − µ) · ε2, by the density of agents on this range, yielding T (µ) · tN (µ) ·σ2 · (ν − µ) · ε2.
Taking the expectation across the population, we can get the e�ect of the nudge on the
attrition rate.
Theorem 7. The e�ect of the nudge on the attrition rate is
−E[σ2]· E[tN (µ)
∣∣µ ≥ θ]·(ν − E
[µ∣∣ θ ≤ µ = θN
])· ε2.
Note that a positive treatment e�ect on taking-up does not guarantee a positive treat-
ment e�ect on attrition rate. For instance, if the new alternative o�er is just another draw
from the original threshold distribution, then it is simple to show that E[µ∣∣ θ ≤ µ = θN
]≥
E [µ | θ = µ].27 In this case, it is possible to have a nudge that induces a positive e�ect on
take-up, but a negative e�ect on attrition rate. Hence for signals that just barely induce a
positive e�ect on take-up, it is sensible for the policy-maker to at least be cognizant of the
possibility that the intervention might induce a higher attrition rate. The �ip side of this is
that for high enough signals, the nudge induces more take-up, through its action on marginal
takers, and less attrition, through its e�ect on infra-marginal takers.
To make this concern more concrete, we can derive the treatment e�ect on attrition in
the normal model with the new alternative o�er being just another draw from the original
threshold distribution.
Proposition 5. In the normal model where tN (θ) = t (θ), the treatment e�ect on attritionis
τA = −ε2 · E[σ2]·
1√1 + η2
· ϕ(Φ−1 (β)
)· Φ(
η · Φ−1 (β)√(1 + η2) · (2 + η2)
)
·
z (ν) +1√
1 + η2· Φ−1 (β)−
η√(1 + η2) · (2 + η2)
·ϕ
(η√
2+η2· Φ−1 (β)
)Φ
(η√
(1+η2)·(2+η2)· Φ−1 (β)
),
,
where z (ν) and η are de�ned as in Theorem 4.
Consider η = 1 and β = 50%. According to Theorem 4, any z (ν) ≥ 0 will induce a positive
treatment e�ect on baseline take-up. But Proposition 5 tells us that the treatment e�ect on
27To show this, we simply look at the likelihood ratio:m(µ | θ≤µ=θN)m(µ | θ=µ) ∝ m(µ)·tN (µ)·T (µ)
m(µ)·t(µ) , where the con-
stant of proportionality does not depend on µ. When tN (µ) = t (µ), we get that the likelihood ratio isproportional to T (µ), which is clearly increasing. From there, the monotone likelihood ratio implies thatm(µ∣∣ θ ≤ µ = θN
)�rst-order stochastically dominates m (µ | θ = µ).
29
attrition rate is only positive if z (ν) ≥ 0.33. Any z (ν) between 0 and 0.33 will induce a
higher baseline take-up rate, but also a higher attrition rate.
6.3 Nudges that reassure
In the basic model, we allowed the nudge signal to be arbitrarily far from agents' prior
expectations, that is, we didn't assume that ν − µ was small. This led to our theory being
dominated by how the nudge directly shifted the expectation. In this section, we will consider
what other forces come into play when ν − µ is small as well. Ultimately, this alternative
modeling assumption demonstrates the potential value of a �reassuring nudge� that simply
increases an agent's con�dence in their prior beliefs � something like �most people have a
more accurate guess at x than they think.�28
Mathematically, let ν − µ is of the same order as α1, γ1, γ2, and ε2. If this is the case,
then we can essentially follow the same logic as before (cf. the end of Section 2.3 and the
beginning of Section 2.5) to derive an expression for the treatment e�ect under our new
assumptions.
Theorem 8. When ν − µ is of the same order as α1, γ1, γ2, and ε2, to leading order, the
treatment e�ect is given by
τ = E [t (µ)] ·{E[σ2]· (ν − E [µ | θ = µ])− 1
2· E[σ3]· E [γ1] +
1
2· E[σ4]· E [α1]
}· ε2.
Essentially, we now have two �extra� forces driving the treatment e�ect, in addition to the
one discussed in Section 2.5. The force proportional to α1 comes from the fact that the
nudge decreases the second moment of the posterior, which leads to a risk-averse agent
having a higher expected utility from taking-up. This force always leads to a bump in
take-up, so long as agents are risk-averse. The force proportional to γ1 comes from the fact
that decreasing the second moment, in addition to a�ecting utility through risk aversion,
also means that tail events are less important. If the agent is worried about left-tail events
(γ1 < 0), then the nudge reassures her that those events are unlikely, leading to a bump in
take-up. Conversely, if the agent is banking on right-tail events (γ1 > 0), the nudge convinces
her that such occurrences are unlikely, leading to a drop in take-up.
28We do not know of any examples of this type of nudge having been analyzed in the literature, butwe see reassuring motives in campaigns designed to encourage people who think they might have observedsuspicious behavior to �see something, say something� and in perennial appeals by teachers to students that�if you have a question, you're not the only one.�
30
6.4 Continuous choices
While this paper has focused on binary choices, in many relevant settings where information
nudges are employed agents have the opportunity to make continuous choices. We can use
the same framework developed in this paper to analyze how information nudges a�ect these
continuous choices.
Instead of a binary choice, now let agents choose an action (a continuous variable a) to
maximize the expectation of a utility function that also depends on x. Expanding utility in
a Taylor series about (a0,E [µ]), where a0 is the action that would be taken by an agent with
µ = E [µ], we get
u (a, x) ≈ u (a0,E [µ]) + u1 (a0,E [µ]) · (a− a0) + u2 (a0,E [µ]) · (x− E [µ])
+1
2· u11 (a0,E [µ]) · (a− a0)2 + u12 (a0,E [µ]) · (x− E [µ]) · (a− a0)
+1
2· u22 (a0,E [µ]) · (x− E [µ])2 .
If we assume that the prior is near-normal and that further terms in the expansion are O (ε2)small, then, to leading order, the agent maximizing E [u (a, x) | ν] is solving
maxa
{u1 (a0,E [µ]) · (a− a0) +
1
2· u11 (a0,E [µ]) · (a− a0)
2
+ u12 (a0,E [µ]) ·[µ− E [µ] + σ2 · (ν − µ) · ε2
]· (a− a0)
}.
In this scenario, the treatment e�ect is simple to derive. We de�ne the responsiveness by
ρ ≡ −u12(a0,E[µ])u11(a0,E[µ])
, and by the same logic that supports the common information acquisition
assumption, we assume that ρ is independent of the other population parameters.
Theorem 9. The average change in action (i.e. the treatment e�ect) is
τ = E [ρ] · E[σ2]· (ν − E [µ]) · ε2.
Note that for continuous choices, the signal at which the e�ect goes from negative to positive
is the average prior expectation across the entire population. Contrast this with the binary
choice model, where the switching point is the average prior expectation of the marginal
admit. The di�erence is intuitive: for continuous choice, we care about the change in utility
for all agents, because it leads to a change in behavior for all agents. For binary choice,
although the utility of all agents is changed, we only care about the change for the smaller
set of agents for whom it will change their behavior, that is, we only care about the utility
change for marginal agents. To further emphasize the di�erence, note that the computation
31
done in this section would look almost identical to trying to compute the average change in
utility in the binary choice model, that is, in the binary choice model, the average change in
utility switches signs as ν crosses E [µ], while the treatment e�ect on take-up changes sign
as ν crosses E [µ | θ = µ].
6.5 Strategic signal revelation
Given that nudges can successfully in�uence behavior of agents, we now turn brie�y to how
practitioners who have an incentive to get agents to take an action would optimally utilize
information nudges to achieve their goals.
We consider a model where the �rm can choose to coarsen the information it reveals,
but there is no leeway for lying, similar to Kamenica and Gentzkow (2011). A revelation
strategy of this sort can be represented by a partition of the real line where, instead of
revealing the signal, the �rm reveals which element of the partition the signal is in. We call
the strategy full revelation whenever each element of its partition is a singleton. First,
note that if the �rm cannot commit to a revelation strategy, then the only equilibrium is full
disclosure. This is because whenever a signal is realized that is near the upper boundary of
a partition element, the �rm would prefer to fully reveal the signal. But if agents know this,
then when the �rm does not fully reveal, they infer that the signal is not near the top of
the partition element. The only revelation strategy that doesn't unravel in this way is full
revelation. But what if the �rm can commit? This turns out not to help either.
Theorem 10. Even if the �rm can commit to a revelation strategy, it cannot make the
treatment e�ect any higher than with full revelation.
Intuitively, to leading order, agents who know only that ν ∈ Π will act as if they received a
deterministic signal that maximizes the likelihood of the even ν ∈ Π. Further, any disagree-
ment about this signal between agents will be O (ε2), and hence can be ignored to leading
order, since it is already multiplied by ε2 in the formula for τ anyway. So all agents act
as if they received the same signal. What's more, this is also what the principal expects ν
to be, conditional on ν ∈ Π, for the same reasons. So, in expectation, full revelation and
coarser revelation strategies yield the same treatment e�ect, to leading order. Notice how
this depends on the linearity of τ in ν, which is reminiscent of the results Kamenica and
Gentzkow (2011), speci�cally Remark 1.
Two points immediately follow from the previous analysis. First, one could imagine that
some agents are �naive�, that is, if the �rm says �nothing� then they do nott update their
beliefs. In this world, if the signal would induce a positive treatment e�ect, it is clear that
the �rm does best to reveal it. If the signal would induce a negative treatment e�ect, then
32
the �rm does best to say �nothing�, as it keeps the marginal naive agents matriculating, but
in expectation is no di�erent from full revelation in terms of the treatment e�ect on the
sophisticated agents. Consequently, in these settings, �rms would only utilize information
nudges that revealed good news about the desired action, a strategy that appears to be
utilized in practice.
Second, if the principal were not risk-neutral about the treatment e�ect he ultimately
gets, then a coarser revelation strategy provides insurance without costing anything in terms
of expected treatment e�ect. This reasoning could help to explain how �rms deal with a
exogenously provided revelation strategies. For instance, cities that require a restaurant to
publicly display its �letter grade� following health inspections. If there exists �ner information
about the restaurant, a �rst intuition says that a restaurants with a �B� should want to report
that it actually got a �B+�. Of course, in doing so, sophisticated customers realize that in
other months where it doesn't report �B+�, it must have actually gotten a �B� or �B-�. By
establishing a policy of never revealing more than the city forces it to, the restaurant insures
itself against a �B-� by not taking the added payo� of a �B+� when it can.
7 Conclusion
As nudges become more prominent in the academic literature and more common as a policy
tool, there is an enhanced interest in understanding why nudges work, when they will be
successful (see, e.g., Beshears et al. (2015b)), and their welfare implications (see, e.g., Allcott
and Kessler (2015)). Central to this exercise is developing models of these nudges that can
give insight into the underlying mechanisms that can guide us to answers of how nudges
might work and who will be persuaded by them.
In this paper, we introducing a theory of information nudges that allows for Bayesian
updating in a setting of binary choice. Our model highlights that in these settings, the rel-
evant question about the sign and of the treatment e�ect is whether the information nudge
provides �good news� about taking the action to agents at the margin. This �nding highlights
the potential value of eliciting beliefs from agents � particularly of those agent at the mar-
gin. Our model additionally suggests, however, that baseline take-up rate in the untreated
group can be a useful proxy for the beliefs of marginal agents. This allows researchers and
practitioners to infer the likely sign and magnitude of a treatment e�ect arising from an
information nudge even without information on beliefs. In a meta-analysis of academic �nd-
ings, we �nd that the relationship between treatment e�ect size and baseline take-up rate
matched the pattern predicted by the theory, allowing us to rationalize previously �puzzling�
results from the literature.
33
References
Allcott, Hunt. 2011. �Social norms and energy conservation.� Journal of Public Economics,
95(9): 1082�1095.
Allcott, Hunt, and Dmitry Taubinsky. 2015. �Evaluating behaviorally-motivated policy:
Experimental evidence from the lightbulb market.� American Economic Review, forthcom-
ing.
Allcott, Hunt, and Judd B. Kessler. 2015. �The Welfare E�ects of Nudges: A Case
Study of Energy Use Social Comparisons.� NBER Working Paper 21671.
Allcott, Hunt, and Todd Rogers. 2015. �The Short-Run and Long-Run E�ects of Be-
havioral Interventions: Experimental Evidence from Energy Conservation.� American Eco-
nomic Review.
Angeletos, George-Marios, and Alessandro Pavan. 2007. �E�cient use of information
and social value of information.� Econometrica, 75(4): 1103�1142.
Avitabile, Ciro, and Rafael E De Hoyos Navarro. 2015. �The Heterogeneous E�ect
of Information on Student Performance: Evidence from a Randomized Control Trial in
Mexico.� World Bank Policy Research Working Paper, , (7422).
Bagnoli, Mark, and Ted Bergstrom. 2005. �Log-concave probability and its applica-
tions.� Economic theory, 26(2): 445�469.
Beshears, John, James J Choi, David Laibson, Brigitte C Madrian, and Kather-
ine L Milkman. 2015a. �The e�ect of providing peer information on retirement savings
decisions.� The Journal of Finance, 70(3): 1161�1201.
Beshears, John, James J. Choi, David Laibson, Brigitte C. Madrian, and
Sean (Yixiang) Wang. 2015b. �Who Is Easier to Nudge?� Working Paper.
Bettinger, Eric P, Bridget Terry Long, Philip Oreopoulos, and Lisa Sanbon-
matsu. 2012. �The role of application assistance and information in college decisions:
Results from the h&r block fafsa experiment*.� The Quarterly Journal of Economics,
127(3): 1205�1242.
Bhargava, Saurabh, and Dayanand Manoli. 2013. �Why are bene�ts left on the table?
Assessing the role of information, complexity, and stigma on take-up with an IRS �eld
experiment.� Amer. Econ. Rev.
34
Bhargava, Saurabh, and Day Manoli. 2015. �Psychological Frictions and the Incomplete
Take-Up of Social Bene�ts: Evidence from an IRS Field Experiment.� American Economic
Review, 105(11): 1�42.
Bollinger, Bryan, Phillip Leslie, and Alan Sorensen. 2011. �Calorie Posting in Chain
Restaurants.� American Economic Journal: Economic Policy, 3(1): 91�128.
Cai, Hongbin, Yuyu Chen, and Hanming Fang. 2009. �Observational Learning: Evi-
dence from a Randomized Natural Field Experiment.� The American Economic Review,
99(3): 864.
Chambers, Christopher P, and Paul J Healy. 2012. �Updating toward the signal.�
Economic Theory, 50(3): 765�786.
Chen, Yan, F Maxwell Harper, Joseph Konstan, and Sherry Xin Li. 2010. �Social
comparisons and contributions to online communities: A �eld experiment on movielens.�
The American economic review, 1358�1398.
Cialdini, Robert B, Linda J Demaine, Brad J Sagarin, Daniel W Barrett, Kelton
Rhoads, and Patricia L Winter. 2006. �Managing social norms for persuasive impact.�
Social in�uence, 1(1): 3�15.
Cialdini, Robert B, Raymond R Reno, and Carl A Kallgren. 1990. �A focus theory
of normative conduct: recycling the concept of norms to reduce littering in public places.�
Journal of personality and social psychology, 58(6): 1015.
Clark, Robert L, Jennifer A Maki, and Melinda Sandler Morrill. 2014. �Can Sim-
ple Informational Nudges Increase Employee Participation in a 401 (k) Plan?� Southern
Economic Journal, 80(3): 677�701.
Co�man, Lucas C, Clayton R Featherstone, and Judd B Kessler. 2014. �Can Social
Information A�ect What Job You Choose and Keep?� Working Paper.
Cornand, Camille, and Frank Heinemann. 2008. �Optimal degree of public information
dissemination*.� The Economic Journal, 118(528): 718�742.
Croson, Rachel, and Jen Yue Shang. 2008. �The impact of downward social information
on contribution decisions.� Experimental Economics, 11(3): 221�233.
Del Carpio, Lucia. 2013. �Are the neighbors cheating? Evidence from a social norm
experiment on property taxes in Peru.� Work. Pap., Princeton Univ., Princeton, NJ.
35
Denuit, Michel M, and Louis Eeckhoudt. 2010. �A general index of absolute risk atti-
tude.� Management Science, 56(4): 712�715.
Featherstone, Clayton R. 2015. �Prior-Free Bayesian Updating.�
Fellner, Gerlinde, Rupert Sausgruber, and Christian Traxler. 2013. �Testing en-
forcement strategies in the �eld: Threat, moral appeal and social information.� Journal
of the European Economic Association, 11(3): 634�660.
Fischbacher, Urs, Simon Gächter, and Ernst Fehr. 2001. �Are people conditionally
cooperative? Evidence from a public goods experiment.� Economics Letters, 71(3): 397�
404.
Frey, Bruno S, and Stephan Meier. 2004. �Social comparisons and pro-social behav-
ior: Testing �conditional cooperation� in a �eld experiment.� American Economic Review,
1717�1722.
Gerber, Alan S, and Todd Rogers. 2009. �Descriptive social norms and motivation to
vote: Everybody's voting and so should you.� The Journal of Politics, 71(01): 178�191.
Goldstein, Noah J, Robert B Cialdini, and Vladas Griskevicius. 2008. �A room
with a viewpoint: Using social norms to motivate environmental conservation in hotels.�
Journal of consumer Research, 35(3): 472�482.
Hallsworth, Michael, John List, Robert Metcalfe, and Ivo Vlaev. 2014. �The be-
havioralist as tax collector: Using natural �eld experiments to enhance tax compliance.�
National Bureau of Economic Research.
Hastings, Justine, Christopher A Neilson, and Seth D Zimmerman. 2015. �The
e�ects of earnings disclosure on college enrollment decisions.� National Bureau of Economic
Research.
Hastings, Justine S, and Je�rey M Weinstein. 2007. �Information, school choice, and
academic achievement: Evidence from two experiments.� National Bureau of Economic
Research.
Ibragimov, Il'dar Abdullovich. 1956. �On the composition of unimodal distributions.�
Theory of Probability & Its Applications, 1(2): 255�260.
Jensen, Robert. 2010. �The (perceived) returns to education and the demand for school-
ing.� The Quarterly Journal of Economics, 125(2): 515�548.
36
Jessoe, Katrina, and David Rapson. 2014. �Knowledge Is (Less) Power: Experimental
Evidence from Residential Energy Use.� American Economic Review, 104(4): 1417�38.
Kamenica, Emir, and Matthew Gentzkow. 2011. �Bayesian Persuasion.� American
Economic Review, 101(6): 2590�2615.
Keser, Claudia, and Frans Van Winden. 2000. �Conditional cooperation and voluntary
contributions to public goods.� The Scandinavian Journal of Economics, 102(1): 23�39.
Kessler, Judd B, and Alvin E Roth. 2015. �Organ Allocation Policy and the Decision
to Donate.� American Economic Review.
Kimball, Miles. 1992. �,Precautionary Motives for Holding Assets.� In The New Palgrave
Dictionary of Money and Finance. , ed. Peter Newman, Murray Milgate and John Eatwell,
158�161. New York:Stockton Press.
Kimball, Miles S. 1990. �Precautionary Saving in the Small and in the Large.� Economet-
rica: Journal of the Econometric Society, 53�73.
Marshall, Albert W, Ingram Olkin, and Barry Arnold. 2010. Inequalities: theory of
majorization and its applications. Springer Science & Business Media.
Martin, Richard, and John Randal. 2008. �How is donation behaviour a�ected by the
donations of others?� Journal of Economic Behavior & Organization, 67(1): 228�238.
Milgrom, Paul R. 1981. �Good news and bad news: Representation theorems and appli-
cations.� The Bell Journal of Economics, 380�391.
Morris, Stephen, and Hyun Song Shin. 2002. �Social value of public information.� The
American Economic Review, 92(5): 1521�1534.
Nelsen, Roger B. 2013. An introduction to copulas. Vol. 139, Springer Science & Business
Media.
Nguyen, Trang. 2008. �Information, Role Models and Perceived Returns to Education:
Experimental Evidence from Madagascar.� M.I.T. Working Paper.
Pomeranz, Dina. 2015. �Taxation without information.� American Economic Review,
105(8): 2539�69.
Potters, Jan, Martin Sefton, and Lise Vesterlund. 2005. �After you � endogenous
sequencing in voluntary contribution games.� Journal of public Economics, 89(8): 1399�
1419.
37
Potters, Jan, Martin Sefton, and Lise Vesterlund. 2007. �Leading-by-example and
signaling in voluntary contribution games: an experimental study.� Economic Theory,
33(1): 169�182.
Prékopa, András. 1973. �Logarithmic concave measures and functions.� Acta Scientiarum
Mathematicarum, 34(1): 334�343.
Rohatgi, Vijay K, and Gábor J Székely. 1989. �Sharp inequalities between skewness
and kurtosis.� Statistics & probability letters, 8(4): 297�299.
Salganik, Matthew J, Peter Sheridan Dodds, and Duncan J Watts. 2006. �Exper-
imental study of inequality and unpredictability in an arti�cial cultural market.� science,
311(5762): 854�856.
Sancetta, Alessio, and Stephen Satchell. 2004. �The Bernstein copula and its applica-
tions to modeling and approximations of multivariate distributions.� Econometric Theory,
20(03): 535�562.
Schultz, P Wesley, Jessica M Nolan, Robert B Cialdini, Noah J Goldstein, and
Vladas Griskevicius. 2007. �The constructive, destructive, and reconstructive power of
social norms.� Psychological science, 18(5): 429�434.
Shang, Jen, and Rachel Croson. 2009. �A �eld experiment in charitable contribution:
The impact of social information on the voluntary provision of public goods.� The Eco-
nomic Journal, 119(540): 1422�1439.
Slemrod, Joel, Marsha Blumenthal, and Charles Christian. 2001. �Taxpayer response
to an increased probability of audit: evidence from a controlled experiment in Minnesota.�
Journal of Public Economics, 79(3): 455�483.
Sunstein, Cass R., and Richard H. Thaler. 2008. Nudge: Improving Decisions About
Health, Wealth, and Happiness. Penguin Books.
Taubinsky, Dmitry. 2014. �From intentions to actions: A model and experimental evidence
of inattentive choice.� Working Paper.
Vesterlund, Lise. 2003. �The informational value of sequential fundraising.� Journal of
Public Economics, 87(3): 627�657.
Wiswall, Matthew, and Basit Zafar. 2015. �Determinants of College Major Choice:
Identi�cation using an Information Experiment.� Review of Economic Studies, 82(4): 791�
824.
38
Mathematical appendix
A Proofs from Section 2
A.1 Conditional independence (Section 2.1)
Proof of Proposition 1. For a signal where ν possibly depends on both x and y, the posterior
over both x and y is
f (x, y | ν) =f (x, y) · n (ν |x, y)´ ´
f (x, y) · n (ν | x, y) · dx · dy.
Hence, the posterior marginal over x is
f (x | ν) =
´f (x, y) · n (ν |x, y) · dy´ ´f (x, y) · n (ν | x, y) · dx · dy
.
Now, ν ⊥ y | x tells us that the signal distribution doesn't depend on y, that is, n (ν |x, y) =
n (ν |x, y′) for all y, y′. Letting n (ν |x) denote this value then, we can simplify our expression
for the posterior marginal over x to
f (x | ν) =n (ν |x) ·
´f (x, y) · dy´
n (ν | x) ·´f (x, y) · dy · dx
.
Since the prior marginal over x is f (x) =´f (x, y) · dy, we have shown the �rst part of the
theorem. The second part follows directly from the de�nition of conditional independence.
A.2 Information nudge approximation (Section 2.2)
Lemma 1. The posterior in the nudge approximation is
f (x | ν) =
{1 + ε2 ·
[(x− µ) · (ν − µ)− 1
2·[(x− µ)2 − σ2
]]}· f (x) .
Proof. To start, we need to use Bayes' rule to compute the posterior under the information
nudge approximation:
f (x | ν) =f (x) · n (ν |x)´f (x) · n (ν | x) · dx
.
39
The denominator for Bayes' rule is then
ˆn (ν | x) · f (x) · dx ∝
ˆ {1− 1
2· ε2 · (x− ν)2
}· f (x) · dx,
∝ 1− 1
2· ε2 ·
ˆ[(x− µ) + (µ− ν)]2 · f (x) · dx,
∝ 1− 1
2· ε2 ·
[σ2 + (ν − µ)2] ,
where we move from the second line to the third by noting that the 2 · (x− µ) · (µ− ν) term
disappears since µ is de�ned as the prior expectation of x. Hence, Bayes' rule gives our
posterior as
f (x | ν) =1− 1
2· ε2 · (x− ν)2
1− 12· ε2 ·
[σ2 + (ν − µ)2] · f (x) ,
≈{
1 +1
2· ε2 ·
[σ2 + (ν − µ)2 − (x− ν)2]} · f (x) .
=
{1 +
1
2· ε2 ·
[σ2 + (ν − µ)2 − [(x− µ) + (µ− ν)]2
]}· f (x) ,
=
{1 + ε2 ·
[(x− µ) · (ν − µ)− 1
2·[(x− µ)2 − σ2
]]}· f (x) .
where the second line uses the approximation 11+ε2≈ 1−ε2 and ignores terms that are o (ε2),
and the third and fourth put (x− ν)2 in a form where x is directly compared to µ instead
of ν.
Going forward, it will be useful to denote the central moments of the prior by µn ≡E [(x− µ)n].29 Now we proceed by computing the posterior expectation.
Lemma 2. The �rst moment of the posterior, centered about µ, is
E [x | ν] = µ+
[σ2 · (ν − µ)− 1
2· µ3
]· ε2.
Further, all posterior moments centered at µ can be expressed as
E [ψ (x) | ν] = E [ψ (x)] +O(ε2).
29Note that this means µ0 = 1, µ1 = 0, and µ2 = σ2.
40
Proof. For the �rst result, simple computation with the formula from Lemma 1 yields
E [x | ν] =
ˆ {1 + ε2 ·
[(x− µ) · (ν − µ)− 1
2·[(x− µ)2 − σ2
]]}· x · f (x) · dx
= µ+
ˆ {(x− µ) + ε2 ·
[(x− µ)2 · (ν − µ)− 1
2·[(x− µ)3 − σ2 · (x− µ)
]]}· f (x) · dx
= µ+ ε2 ·[σ2 · (ν − µ)− 1
2· µ3
].
The second follows directly from the fact that the posterior distribution is of form f (x) +
O (ε2).
Going forward, it will be easier to compute centralized moments of the posterior taken
around µ, but we will ultimately be more interested in central moments, taken around
E [x | ν]. Fortunately, these are related in a simple way.
Lemma 3. Moments of the posterior centered about µ and E [x | ν] are related by the following
expression:
E [(x− E [x | ν])n | ν] = E [(x− µ)n | ν]− n · µn−1 · σ2 ·[σ2 · (ν − µ)− 1
2· µ3
]· ε2 + o
(ε2).
Proof. If we express E [(x− E [x | ν])n | ν] as E[(
(x− µ)−[σ2 · (ν − µ)− 1
2· µ3
]· ε2)n ∣∣ ν],
all but the �rst two terms of the binomial expansion will be o (ε2). Then,
E [(x− E [x | ν])n | ν] = E [(x− µ)n | ν]− n · E[(x− µ)n−1
∣∣ ν] · [σ2 · (ν − µ)− 1
2· µ3
]· ε2 + o
(ε2),
= E [(x− µ)n | ν]− n · µn−1 ·[σ2 · (ν − µ)− 1
2· µ3
]· ε2 + o
(ε2),
where the second line came from the second part of Lemma 2.
Now, we can derive an express for the central moments of the posterior.
Theorem 11. The posterior central moments (that is, centered around E [x | ν] are:
E [(x− E [x | ν])n | ν]
= µn +
{(µn+1 − n · µn−1 · σ2
)· (ν − µ)− 1
2·[µn+2 − σ2 · µn − n · µ3 · µn−1
]}· ε2
+ o(ε2).
41
Proof. Looking to the formula for the posterior distribution, an arbitrary moment centered
at µ is given by
E [(x− µ)n | ν] = µn + ε2 ·{µn+1 · (ν − µ)− 1
2·[µn+2 − σ2 · µn
]}.
Plugging this into the result from Lemma 3 then yields
E [(x− E [x | ν])n | ν]
= µn+
{µn+1 · (ν − µ)− 1
2·[µn+2 − σ2 · µn
]− n · µn−1 ·
[σ2 · (ν − µ)− 1
2· µ3
]}·ε2+o
(ε2)
= µn+
{(µn+1 − n · µn−1 · σ2
)· (ν − µ)− 1
2·[µn+2 − σ2 · µn − n · µ3 · µn−1
]}·ε2+o
(ε2).
To further simplify this expression under the near-symmetric approximation, simply set all
odd moments equal to zero. To simplify further with the near-normal approximation, note
that for a normal distribution, µn+2 = (n+ 1)!! · σn+2 for all n ≥ 1.30
Finally, we show that Theorem 1 in the main text is a corollary of what we just proved.
Proof of Theorem 1. Lemma 2 and the identity γ1 ≡ µ3
σ3 yields the expression for the posterior
expectation. Plugging n = 2 into the expression from Theorem 11 yields
Var (x | ν) = µ2 +
{µ3 · (ν − µ)− 1
2·[µ4 − σ2 · µ2
]}· ε2 + o
(ε2),
= σ2 +
{σ3 · µ3
σ3· (ν − µ)− 1
2·[σ4 ·
(µ4
σ4− 3 + 3
)− σ4
]}· ε2 + o
(ε2),
= σ2 +
{σ3 · γ1 · (ν − µ)− σ4 ·
(1 +
1
2· γ2
)}· ε2 + o
(ε2).
where the third line comes from remembering that γ2 ≡ µ4
σ4 − 3. The results when γ1 and γ2
are small follow trivially.
30The double factorial is de�ned such that, for even n, (n+ 1)!! = Πn/2i=0 (2i+ 1) . So, for instance, 7!! =
7 · 5 · 3 · 1 = 105.
42
A.3 How the population responds to an information nudge (Sec-
tion 2.5)
Proof of Theorem 2. Taking the expectation of the product mentioned in the text yields
τ = E[t (µ) · σ2 · (ν − µ) · ε2
],
= ε2 · E[σ2]ˆ
t (µ) · (ν − µ) ·m (µ) · dµ,
= ε2 · E[σ2]ˆ
t (µ) ·m (µ) · dµ ·ˆ
(ν − µ) · t (µ) ·m (µ)´t (µ) ·m (µ) · dµ
· dµ,
= ε2 · E[σ2]· E [t (µ)] · (ν − E [µ | θ = µ]) ,
= ε2 · E[σ2]· E [m (θ)] · (ν − E [µ | θ = µ]) ,
where the last two lines establish the result.
B Proofs from Section 3
B.1 Threshold and belief distribution shifters (Section 3.2)
Two lemmas will prove useful.
Lemma 4. If t (θ |λt) and m (µ |λm) are ordered in the MLR sense, then the distribution
of prior expectations among marginal agents, g (µ | θ = µ; λt, λm) ≡ t(µ |λt)·m(µ |λm)´t(µ |λt)·m(µ |λm)·dµ , is
ordered in the MLR sense in both (µ, λm) and (µ, λt).
Proof. The likelihood ratio with respect to (µ, λt),g(µ | θ=µ;λ′t,λm)
g(µ | θ=µ;λt,λm), is given by:
g (µ | θ = µ; λ′t, λm)
g (µ | θ = µ; λ′t, λm)=
t(µ |λ′t)·m(µ |λm)´t(µ |λ′t)·m(µ |λm)·dµt(µ |λt)·m(µ |λm)´t(µ |λt)·m(µ |λm)·dµ
=t (µ |λ′t)t (µ |λt)
·´t (µ |λt) ·m (µ |λm) · dµ´t (µ |λ′t) ·m (µ |λm) · dµ
.
Since the ratio of integrals does not vary with µ, we have shown our result with respect to
(µ, λt). Similar logic shows it with respect to (µ, λm).
Lemma 5. For all x,
• limλt→−∞
G (x | θ = µ; λt, λm) = limλm→−∞
G (x | θ = µ; λt, λm) = 1, and
• limλt→∞
G (x | θ = µ; λt, λm) = limλm→∞
G (x | θ = µ; λt, λm) = 0.
43
Proof. Writing out the de�nition of the conditional distribution functions, we get
G (x | θ = µ; λt, λm) =
´ x t (µ |λt) ·m (µ |λm) · dµ´t (µ |λt) ·m (µ |λm) · dµ
=T (x |λt) ·
´ x t (µ |µ ≤ x; λt) ·m (µ |λm) · dµT (x |λt) ·
´ x t (µ |µ ≤ x; λt) ·m (µ |λm) · dµ+ [1− T (x |λt)] ·´x t (µ |µ ≥ x; λt) ·m (µ |λm) · dµ
.
Clearly, as λt → ∞, this is zero, and when λt → −∞, it is one, since the integrals are all
�nite for �xed value of x. Similar logic works for λm → ±∞.
Now, we can prove the assertion made in Footnote 18 of the main text.
Lemma 6. Given the assumptions of Section 3.2, the following are true:
• E [µ | θ = µ; λt, λm] is increasing in λt, and its range for any �xed λm is (−∞,∞).
• E [µ | θ = µ; λt, λm] is increasing in λm, and its range for any �xed λt is (−∞,∞).
Proof of Lemma 6. Lemma 4 immediately implies the �rst �half� of both parts, since µ is an
increasing function. For second �half� of Part 1, note that E [µ | θ = µ; λt, λm] can be written
as an integral of the conditional distribution function, that is,
E [µ | θ = µ; λt, λm] =
ˆ ∞0
[1−G (x | θ = µ; λt, λm)] · dx−ˆ 0
−∞G (x | θ = µ; λt, λm) · dx.
Given this, the result on the range of E [µ | θ = µ; λt, λm] with respect to either λm or λt is
a straightforward implication of Lemma 5.
B.2 Shifter comparative statics (Sections 3.2.1 - 3.2.2)
Before our next set of claims, we prove two useful lemmas.
Lemma 7. The following are true:
1. If limµ→±∞
ψ (µ) exists, then limλt→±∞
´t (µ |λt) · ψ (µ) · dµ = lim
µ→±∞ψ (µ).
2. If limµ→±∞
ψ (µ) exists, then limλm→±∞
´m (µ |λm) · ψ (µ) · dµ = lim
µ→±∞ψ (µ).
Proof. For the �−∞� piece of Part 1, note that for any value of x, our integral can also be
expressed as
T (x |λt) ·ˆ x
t (µ | µ ≤ x; λt) · ψ (µ) · dµ+ [1− T (x |λt)] ·ˆx
t (µ | µ ≥ x; λt) · ψ (µ) · dµ,
44
Clearly then, for any x, the following bounds must hold:
T (x |λt) ·minµ≤x
ψ (µ) + [1− T (x |λt)] ·minµ≥x
ψ (µ)
≤ˆt (µ |λt) · ψ (µ) · dµ ≤
T (x |λt) ·maxµ≤x
ψ (µ) + [1− T (x |λt)] ·maxµ≥x
ψ (µ) .
Taking the limit of these bounds as λt → −∞, we get minµ≤x
ψ (µ) ≤ limm→−∞
´t (µ |λt) · ψ (µ) ·
dµ ≤ maxµ≤x
ψ (µ). If we then take the limit as x→ −∞, we get limλt→−∞
´t (µ |λt) ·ψ (µ) ·dµ =
limµ→−∞
ψ (µ). The same logic with λt, x→∞ yields the �+∞� piece of Part 1. Part 2 follows
by identical logic.
Lemma 8. If f (x) is log-concave on some domain, then on that domain, it can only have
one maximum.
Proof. Log-concavity implies that ∂2 log f∂x2 < 0 and f > 0. Together, these imply that f ′′ < f ′2
f.
At an optimum, f ′ = 0, which combined with f ′′ < f ′2
fmeans that f ′′ < 0, the second-order
condition for a local maximum. And if all optima are maxima, then there can only be one
on the range.
Proof of Proposition 2. For Part 1, it will be easiest to deal with the treatment e�ect in the
form τ = ε2 · E [σ2] ·´t (µ |λt) · m (µ) · (ν − µ) · dµ: simply apply Lemma 7 with ψ (µ) =
m (µ) · (ν − µ). Since´m (µ) · dµ = 1 implies that the tail of m (µ) is o (1/ |µ|), we know
that limµ→±∞
m (µ) · (ν − µ) = 0. It then follows that limλt→±∞
´t (µ |λt) ·m (µ) · (ν − µ) · dµ = 0,
and hence limλt→±∞
τ (λt) = 0. Part 2 is obvious given the form of τ in Theorem 2: the sign of τ
is the same as ν − θ. To show Part 3, again think about the form of τ in Theorem 2. Recall
that E [t (µ |λt)] ≡´t (µ |λt) · m (µ) · dµ. Clearly, the product of log-concave functions is
log-concave, so the integrand is log-concave in (µ, λt). Prékopa (1973) then tells us that the
marginals of log-concave functions are log-concave with respect to the remaining variables;
hence, E [t (µ |λt)] is log-concave in λt. Now, since ν−λt is log-concave whenever λt < ν,31 we
have established that τ (λt) is log-concave on that domain. Since limλt→−∞
τ (λt) = τ (ν) = 0,
there must be a local maximum on λt < ν. By Lemma 8, this is the unique maximum.
Similar logic applied to −τ (λt) on the domain λt > ν gives us a unique minimum on that
domain.
The proof of Proposition 3 uses identical logic.
31To see this, note that log (ν − λt) only exists when λt < ν, and when it does, ∂2
∂λ2t
log (ν − λt) =
− 1(ν−λt)
2 < 0.
45
B.3 Proxying for shifters with matriculation (Section 3.3)
Lemma 9. The following results are true:
• β (λt, λm) is increasing in λm and decreasing in λt.
• If the domain of (λt, λm) is (−∞,∞)× (−∞,∞), then the range of β (λt, λm) is [0, 1].
Proof of Lemma 9. To see the �rst part, recall that the MLR ordering implies the �rst-order
stochastic ordering, which is equivalent to T2 (µ |λt) < 0 andM2 (µ |λm) < 0. For the second
part, note that the matriculation integral obeys the bounds
0 ·ˆ x
m (µ) · dµ+ T (x |λt) ·ˆx
m (µ) · dµ
≤ β (λt) ≤
T (x |λt) ·ˆ x
m (µ) · dµ+ 1 ·ˆx
m (µ) · dµ.
As λt → −∞, the bounds become´xm (µ) ·dµ ≤
´T (µ |λt) ·m (µ) ·dµ ≤
´m (µ) ·dµ, which
as x→ −∞, implies via the squeeze theorem that limλt→−∞
β (λt) = 1. As λt →∞, the bounds
become 0 ≤´T (µ |λt) ·m (µ) · dµ ≤
´xm (µ) · dµ, which as x→∞, implies lim
λt→∞β (λt) = 0.
Similar logic shows a similar result for the λm shifter.
Theorem 3 is a result of plugging Lemma 9 into Theorems 2 and 3.
C Normal model (Section 4.1)
As described qualitatively in the main text, we will assume that µ and θ are distributed
according to density functions
m (µ) = ϕ (µ− E [µ]) ,
t (θ |λt) =1
η· ϕ(θ − E [θ |λt]
η
).
where ϕ is the density of the standard normal, and E [θ |λt] ≡ (1 + η2) · λt − η2 · E [µ].
Our de�nition for E [θ |λt] is without loss of generality, and it ultimately leads to λt =
E [µ | θ = µ; λt], as in the model in the main text (we will show this in the following proof).
Also, note that scaling all quantities up or down doesn't change the model, so we are free
to normalize Var [µ] to 1. We will derive things in terms of η2 ≡ Var[µ]Var[θ]
, but the choice or
normalization will come back towards the end of the following proof.
46
Proof of Theorem 4. We start by calculating the matriculation e�ect. Since µ and θ are
independent and normally distributed, their di�erence is distributed according to θ − µ ∼1√
1+η2·ϕ(
(θ−µ)−(E[θ |λt]−E[µ])√1+η2
), which, since agents join when θ−µ ≤ 0, means that β (λt) =
Φ
(E[µ]−E[θ |λt]√
1+η2
)= Φ
[√1 + η2 · (E [µ]− λt)
]. Inverting, we can express λt in terms of β as
λt (β) = E [µ]− 1√1+η2· Φ−1 (β).
Now, we move on to calculate the treatment e�ect. To do so, we simplify the expression
t (µ |λt) ·m (µ) by using the following identity concerning the product of two normals:
ϕ
(µ− ab
)· ϕ(µ− cd
)= ϕ
(a− c√b2 + d2
)· ϕ
(µ− a·d2+c·b2
b2+d2
b·d√b2+d2
).
Doing so gives us,
t (µ |λt) ·m (µ) =1
η· ϕ
(E [θ |λt]− E [µ]√
1 + η2
)· ϕ
µ− E[θ |λt]+E[µ]·η2
1+η2
η√1+η2
,
=1√
1 + η2· ϕ
λt − E [µ]1√
1+η2
· 1η√
1+η2
· ϕ
µ− λtη√
1+η2
,
where the second line comes from plugging in E [θ |λt] = (1 + η2) · λt − η2 · E [µ]. Using this
expression, it is easy to show that
E [t (µ |λt)] =
ˆt (µ |λt) ·m (µ) · dµ =
1√1 + η2
· ϕ
λt − E [µ]1√
1+η2
,
E [µ | θ = µ; λt] =
´t (µ |λt) ·m (µ) · µ · dµ´t (µ |λt) ·m (µ) · dµ
= λt,
where the second line makes good on our promise from earlier. Hence,
τ (λt) = ε2 · E[σ2]· 1√
1 + η2· ϕ
λt − E [µ]1√
1+η2
· (ν − λt) ,τ (β) = ε2 · E
[σ2]· 1√
1 + η2· ϕ(Φ−1 (β)
)·
(z (ν) +
1√1 + η2
· Φ−1 (β)
),
where the second line came from plugging in λt (β) = E [µ] − 1√1+η2
· Φ−1 (β), using the
fact that ϕ (−x) = ϕ (x), and remembering that since we normalized to make the standard
47
deviation of the belief distribution equal to one, ν −E [µ] equals the z-score of ν normalized
against the belief distribution, which we denote as z (ν).
Proof of Theorem 4. From this formula, we can easily derive the β0 from Theorems 3 to be
β0 = Φ(−z (ν) ·
√1 + η2
).
Since this is the only zero of τ (β), for any β 6= β0 the treatment e�ect has local optima
wherever
1
τ (β)· ∂τ (β)
∂β=
−Φ−1 (β) +1√
1 + η2· 1
z (ν) + 1√1+η2· Φ−1 (β)
· ∂Φ−1 (β)
∂β= 0,
where we used the fact that ϕ′(x)ϕ(x)
= −x to simplify. Making the term in curly brackets equal
to zero is a simple matter of solving a quadratic in Φ−1 (β) whose solutions are
β± = Φ
(1
2· Φ−1 (β0)±
√1
4· [Φ−1 (β0)]2 + 1
),
= Φ
−1
2· z (ν) ·
√1 + η2 ±
√[1
2· z (ν) ·
√1 + η2
]2
+ 1
.
Clearly, the β+ is the maximum, while β− is the minimum.
Proposition 6. In the normal model, when the signal is �good news�, that is, z(ν) > 0, we
know that, β0 ∈ (0,Φ (−z (ν))), which means that
• β+ ∈(
12,Φ
(−1
2· z (ν) +
√(12· z (ν)
)2+ 1
)),
• β− ∈(
0,Φ
(−1
2· z (ν)−
√(12· z (ν)
)2+ 1
)),
• τ (β) is increasing on β ∈(
Φ
(−1
2· z (ν)−
√(12· z (ν)
)2+ 1
), 1
2
),
• τ (β) is decreasing on β ∈(
Φ
(−1
2· z (ν) +
√(12· z (ν)
)2+ 1
), 1
).
Similarly, when the signal is �bad news�, that is, z(ν) < 0, we know that β0 ∈ (Φ (−z (ν)) , 1),
which means that
• β+ ∈(
Φ
(−1
2· z (ν) +
√(12· z (ν)
)2+ 1
), 1
),
48
• β− ∈(
Φ
(−1
2· z (ν)−
√(12· z (ν)
)2+ 1
), 1
2
),
• τ (β) is decreasing on β ∈(
0,Φ
(−1
2· z (ν)−
√(12· z (ν)
)2+ 1
)),
• τ (β) is increasing on β ∈(
12,Φ
(−1
2· z (ν) +
√(12· z (ν)
)2+ 1
)).
Proof. If the signal is �good news�, i.e. z (ν) > 0, then the formula β0 = Φ(−√
1 + η2 · z (ν))
makes it clear that β0 ∈ (0,Φ (−z (ν))). Now, the derivative of the extrema locations is
∂β±∂β0
= ϕ
1
2· Φ−1 (β0)±
√(1
2· Φ−1 (β0)
)2
+ 1
·12·
1±12· Φ−1 (β0)√(
12· Φ−1 (β0)
)2+ 1
·∂Φ−1
∂β(β0) ,
which is always positive. Hence we can plug in the domain of β0 to get the range of β±.
Doing so yields the ranges in the theorem. To understand the ranges on which τ (β) is
increasing and decreasing, note that the curve is necessarily decreasing on β ∈ (0, β−),
increasing on β ∈ (β−, β+), and decreasing on β ∈ (β+, 1). This means that the curve is
always decreasing on β ∈(
0,minη{β−}
), increasing on β ∈
(maxη{β−} ,min
η{β+}
), and
decreasing on β ∈(
maxη{β+} , 1
), which gives the ranges in the theorem. We have only laid
out the rationale for the �good news� case in the theorem, but the �bad news� case follows
from identical logic.
Proof of Proposition 4. To derive the ranges for β±, we simply �nd the z (ν) that makes the
range from Proposition 4 the largest. The ranges of monotonicity follow for the same reasons
described in the proof of Proposition 4.
Proof of Theorem 5. The intervals that de�ne Λ (β | η) and Υ (β | η) follow trivially from theshape of the treatment e�ect curve and the de�nition of b (η). To sign b′ (η), we start bytaking the derivative of the log of its de�nition, log [τ (b (η) | η)] = log [τ (β | η)]:32
∂τ∂β
(b (η)) · b′ (η)− η1+η2
·[τ (b (η) | η) + ε2 · E
[σ2]· 1
1+η2· ϕ(Φ−1 (b (η))
)· Φ−1 (b (η))
]τ (b (η) | η)
=− η
1+η2·[τ (β | η) + ε2 · E
[σ2]· 1
1+η2· ϕ(Φ−1 (β)
)· Φ−1 (β)
]τ (β | η)
,
32This is allowed even when the argument of the log is negative. To see this, assume x < 0, and note thatlog (x) = log
(|x| · e−π·i
)= log (|x|) + log
(e−π·i
)= log (|x|)− π · i.
49
which simpli�es to
∂τ∂β
(b (η)) · b′ (η)
τ (b (η) | η)−
η
(1+η2)3/2 · Φ−1 (b (η))
z (ν) + 1√1+η2· Φ−1 (b (η))
= −η
(1+η2)3/2 · Φ−1 (β)
z (ν) + 1√1+η2· Φ−1 (β)
,
which in turn simpli�es to
b′ (η) =η
(1 + η2)3/2·τ (b (η) | η)
τ ′ (b (η))·
Φ−1 (b (η)) ·
[z (ν) + 1√
1+η2· Φ−1 (β)
]− Φ−1 (β) ·
[z (ν) + 1√
1+η2· Φ−1 (b (η))
][z (ν) + 1√
1+η2· Φ−1 (b (η))
]·[z (ν) + 1√
1+η2· Φ−1 (β)
] ,
=
η
(1+η2)3/2· τ (b (η) | η) · z (ν)[
z (ν) + 1√1+η2
· Φ−1 (b (η))
]·[z (ν) + 1√
1+η2· Φ−1 (β)
] · 1
τ ′ (b (η))·{
Φ−1 (b (η))− Φ−1 (β)},
=
{η
(1 + η2)5/2· ε4 ·
(E[σ2])2 · ϕ (Φ−1 (β)
)· ϕ(Φ−1 (b (η))
)·}·z (ν)
τ (β | η)·
1
τ1 (b (η) | η)·{
Φ−1 (b (η))− Φ−1 (β)}.
Now, the �rst term is positive, regardless. For both parts of the proposition, z (ν) and τ (β | η)
have the same sign, and τ1 (β | η) is negative (which means that τ1 (b (η) | η) is positive).
Hence, for both cases considered in the proposition,
sgn {b′ (η)} = sgn{
Φ−1 (b (η))− Φ−1 (β)},
= sgn {b (η)− β} .
The assertions about the sign of b′ (η) in the proposition follow directly from this and the
general shape of the treatment e�ect curve.
Now, to establish that setting η = 0 gives us an inner bound, we just have to show that β
ful�lling our assumptions doesn't also tell us that η has a lower bound greater than zero.
For the �rst part, note that being on the positive and decreasing part of τ (β) puts β ≥ β+,
and hence β ≥ 12by Proposition 4. If β ≥ 1
2, then Φ−1 (β) ≥ 0, which means that τ (β) ≥ 0
for any η > 0, since the sign of τ is the same as the sign of z (ν) + 1√1+η2· Φ−1 (β). For the
second part, being negative and decreasing puts β ≤ β−, and hence β ≤ 12by Proposition 4.
Similar logic shows that τ (β) ≤ 0 for any η > 0.
50
D Proofs from Section 6
D.1 Attrition (Section 6.2)
Proof of Theorem 7. Integrating the product mentioned in the main text over µ and σ, weget
ˆT (µ) · tN (µ) · σ2 · (ν − µ) · ε2 ·m (µ) · s
(σ2)· dµ · dσ2
= E[σ2]·{ˆ
tN (µ) · T (µ) ·m (µ) · dµ}·{ ´
tN (µ) · T (µ) ·m (µ) · (ν − µ) · dµ´tN (µ) · T (µ) ·m (µ) · dµ
}· ε2
= E[σ2]·{ˆ
T (µ) ·m (µ) · dµ}·{ ´
tN (µ) · T (µ) ·m (µ) · dµ´T (µ) ·m (µ) · dµ
}·(t− E
[µ∣∣∣ θ ≤ µ = θN
])· ε2
= E[σ2]· β · E
[tN (µ)
∣∣∣µ ≥ θ] · (t− E[µ∣∣∣ θ ≤ µ = θN
])· ε2.
Dividing through by the baseline matriculation rate gives the e�ect on attrition, which we
is what we sought to prove.
Now we compute this e�ect in the normal model.
Proof of Proposition 5. Let µ ∼ Φ (µ− E [µ]) and θ, θN ∼ Φ(θ−E[θ |λt]
η
), independently,
where E [θ |λt] ≡ (1 + η2) · λt − η2 · E [µ]. In the proof of Theorem 4, we already showed
that β (λt) = Φ[√
1 + η2 · (E [µ]− λt)], and hence λt (β) = E [µ] − 1√
1+η2· Φ−1 (β). So,
E [θ | β] ≡ E [µ]−√
1 + η2 · Φ−1 (β), and hence θ, θN ∼ Φ
(θ−(E[µ]−√
1+η2·Φ−1(β))
η
). Now,
E[µ∣∣∣ θ ≤ µ = θN
]=
´µ · ϕ (µ− E [µ]) · Φ
(µ−(E[µ]−
√1+η2·Φ−1(β)
)η
)· 1η· ϕ(µ−(E[µ]−
√1+η2·Φ−1(β)
)η
)· dµ
´ϕ (µ− E [µ]) · Φ
(µ−(E[µ]−
√1+η2·Φ−1(β)
)η
)· 1η· ϕ(µ−(E[µ]−
√1+η2·Φ−1(β)
)η
)· dµ
.
To start this computation, we recall our identity from the proof of Theorem 4,
ϕ
(µ− ab
)· ϕ(µ− cd
)= ϕ
(a− c√b2 + d2
)· ϕ
(µ− a·d2+c·b2
b2+d2
b·d√b2+d2
),
which allows us to write
ϕ (µ− E [µ]) · ϕ
µ−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
= ϕ
E [µ]−(E [µ]−
√1 + η2 · Φ−1 (β)
)√
1 + η2
· ϕµ−
E[µ]·η2+(E[µ]−
√1+η2·Φ−1(β)
)1+η2
η√1+η2
,
= ϕ(Φ−1 (β)
)· ϕ(µ ·√
1 + η2 − E [µ] ·√
1 + η2 + Φ−1 (β)
η
).
51
Now, we can write
ˆΦ
µ−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
· ϕ (µ− E [µ]) ·1
η· ϕ
µ−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
· dµ= ϕ
(Φ−1 (β)
)·ˆ
Φ
µ−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
· 1
η· ϕ(µ ·√
1 + η2 − E [µ] ·√
1 + η2 + Φ−1 (β)
η
)· dµ
= ϕ(Φ−1 (β)
)·ˆ
Φ
(
η√1+η2
· x+ E [µ]− Φ−1(β)√1+η2
)−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
· ϕ (x) ·1
η·
η√1 + η2
· dx
= ϕ(Φ−1 (β)
)·ˆ
Φ
(x√
1 + η2+
η√1 + η2
· Φ−1 (β)
)· ϕ (x) ·
1√1 + η2
· dx
=1√
1 + η2· ϕ(Φ−1 (β)
)· Φ(
η · Φ−1 (β)√(1 + η2) · (2 + η2)
),
where we use the Gaussian integral identity´
Φ (a+ b · x) · ϕ (x) · dx = Φ(
a√1+b2
), and
ˆµ · ϕ (µ− E [µ]) · Φ
µ−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
· 1
η· ϕ
µ−(E [µ]−
√1 + η2 · Φ−1 (β)
)η
· dµ= ϕ
(Φ−1 (β)
)·ˆ (
η√1 + η2
· x+ E [µ]−Φ−1 (β)√
1 + η2
)· Φ(
x√1 + η2
+η√
1 + η2· Φ−1 (β)
)· ϕ (x) ·
1√1 + η2
· dx
= ϕ(Φ−1 (β)
)·ˆ (
η√1 + η2
· x)· Φ(
x√1 + η2
+η√
1 + η2· Φ−1 (β)
)· ϕ (x) ·
1√1 + η2
· dx
+ ϕ(Φ−1 (β)
)·(E [µ]−
Φ−1 (β)√1 + η2
)·ˆ
Φ
(x√
1 + η2+
η√1 + η2
· Φ−1 (β)
)· ϕ (x) ·
1√1 + η2
· dx
=η
1 + η2· ϕ(Φ−1 (β)
)·ˆx · Φ
(x√
1 + η2+
η√1 + η2
· Φ−1 (β)
)· ϕ (x) · dx
+
(E [µ]−
Φ−1 (β)√1 + η2
)·
1√1 + η2
· ϕ(Φ−1 (β)
)· Φ(
η · Φ−1 (β)√(1 + η2) · (2 + η2)
)
=1√
1 + η2· ϕ(Φ−1 (β)
)·{
η√(1 + η2) · (2 + η2)
· ϕ(η · Φ−1 (β)√
2 + η2
)+
(E [µ]−
Φ−1 (β)√1 + η2
)· Φ(
η · Φ−1 (β)√(1 + η2) · (2 + η2)
)}
where we use the Gaussian integral identity´x ·Φ (a+ b · x) ·ϕ (x) · dx = b√
1+b2·ϕ(
a√1+b2
).
So,
E[µ∣∣∣ θ ≤ µ = θN
]
=
1√1+η2
· ϕ(Φ−1 (β)
)·{
η√(1+η2)·(2+η2)
· ϕ(η·Φ−1(β)√
2+η2
)+
(E [µ]− Φ−1(β)√
1+η2
)· Φ(
η·Φ−1(β)√(1+η2)·(2+η2)
)}1√
1+η2· ϕ (Φ−1 (β)) · Φ
(η·Φ−1(β)√
(1+η2)·(2+η2)
),
= E [µ]−1√
1 + η2·
Φ−1 (β)−η√
2 + η2·
ϕ
(η√
2+η2· Φ−1 (β)
)Φ
(η√
(1+η2)·(2+η2)· Φ−1 (β)
),
.
52
Now, we can compute the treatment e�ect on attrition rate
τA = E[σ2]· E[tN (µ)
∣∣∣µ ≥ θ] · (t− E[µ∣∣∣ θ ≤ µ = θN
])· ε2,
= ε2 · E[σ2]·
1√1 + η2
· ϕ(Φ−1 (β)
)· Φ(
η · Φ−1 (β)√(1 + η2) · (2 + η2)
)
·
z (ν) +1√
1 + η2· Φ−1 (β)−
η√(1 + η2) · (2 + η2)
·ϕ
(η√
2+η2· Φ−1 (β)
)Φ
(η√
(1+η2)·(2+η2)· Φ−1 (β)
),
.
This is what we sought to prove.
Comparing the treatment e�ect on attrition rate to the treatment e�ect on matriculation
rate, we see that while the latter is positive when z (ν) ≥ E [µ | θ = µ] = − 1√1+η2· Φ−1 (β),
the former is positive when z (ν) ≥ E[µ∣∣ θ ≤ µ = θN
]= E [µ | θ = µ] + η√
(1+η2)·(2+η2)·
ϕ
(η√
2+η2·Φ−1(β)
)Φ
(η√
(1+η2)·(2+η2)·Φ−1(β)
),
. Since this extra term is positive, it represents the width of the
range of z (ν) such that we expect positive treatment e�ect on matriculation and negative
treatment e�ect on attrition rate.
D.2 Nudges that reassure (Section 6.3)
Proof of Theorem 8. Recall from Sections 2.3 and 2.5 that untreated agents join when E [x]−12·α1·E
[(x− E [µ])2] ≥ θ, while treated agents join when E [µ | ν]− 1
2·α1·E
[(x− E [µ])2
∣∣ ν] ≥θ. Hence, the width of the range of thresholds that are persuaded by the nudge is
E [x | ν]− E [x]− 1
2· α1 ·
{E[(x− E [µ])2
∣∣ ν]− E[(x− E [µ])2]}
= σ2 ·{
(ν − µ)− 1
2· σ · γ1
}· ε2 − 1
2· α1 ·
{E[(x− E [x | ν])2
∣∣ ν]− E[(x− µ)2]
+ (E [x | ν]− E [µ])2 − (µ− E [µ])2} .Expanding out the components of the α1 term, we get
E[(x− E [x | ν])2
∣∣ ν]− E[(x− µ)2] = −σ2 ·
[σ2 ·
(1 +
1
2· γ2
)− σ · γ1 · (ν − µ)
]· ε2,
= −σ4 · ε2 − 1
2· σ4 · γ2 · ε2 + σ3 · γ1 · (ν − µ) · ε2,
= −σ4 · ε2 +O(γ2 · ε2
)+O
(γ1 · (ν − µ) · ε2
),
53
and
(E [x | ν]− E [µ])2−(µ− E [µ])2 =
((µ− E [µ]) + σ2 ·
{(ν − µ)− 1
2· σ · γ1
}· ε2
)2
−(µ− E [µ])2
= 2 · (µ− E [µ]) · σ2 ·{
(ν − µ)− 1
2· σ · γ1
}· ε2 + σ4 ·
{(ν − µ)− 1
2· σ · γ1
}2
· ε4
= O((ν − µ) · ε2
)+O
(γ1 · ε2
)+O
((ν − µ) · ε4
)+O
(γ1 · ε4
).
Hence, the width of the range of thresholds is
E [x | ν]− E [x]− 1
2· α1 ·
{E[(x− E [µ])2
∣∣ ν]− E[(x− E [µ])2]}
= σ2 ·{
(ν − µ)− 1
2· σ · γ1
}· ε2 +
1
2· σ4 · α1 · ε2 + o
(α1 · ε2
).
We will multiply this by the density of thresholds, evaluated at E [x]− 12·α1·E
[(x− E [µ])2] =
µ+O (α1), which is equal to t (µ)+O (α1). To leading order, then, the product of the density
and the range is t (µ)·[σ2 ·
{(ν − µ)− 1
2· σ · γ1
}+ 1
2· σ4 · α1
]·ε2.Taking the expectation gives
τ = E [t (µ)] ·{E[σ2]· (ν − E [µ | θ = µ])− 1
2· E[σ3]· E [γ1] +
1
2· E[σ4]· E [α1]
}· ε2,
which is what we sought to show.
D.3 Continuous choices (Section 6.4)
Proof of Theorem 9. Let a∗ (ε2) be the maximizer of the optimization from the main text.
The �rst-order conditions for this optimization require
a∗(ε2)
= a0 −u1 (a0,E [µ]) + u12 (a0,E [µ]) · [µ− E [µ] + σ2 · (ν − µ) · ε2]
u11 (a0,E [µ]).
To leading order then, a∗ will move by
ε2 · ∂a∗
∂ε2(0) = −u12 (a0,E [µ])
u11 (a0,E [µ])· σ2 · (ν − µ) · ε2.
Taking the expectation across the population yields our result.
54
D.4 Strategic signal revelation (Section 6.5)
Proof of Theorem 10. First, we need to calculate how agents will respond to being told that
ν is in partition element Π. De�ne ω (Π | ν0) ≡´
Πn (ν | ν0) · dν. Then,
f (x |Π) =f (x) · ω (Π |x)´f (x) · ω (Π | x) · dx
=f (x) ·
{ω (Π | ν0) + ω2 (Π | ν0) · (x− ν0) · ε+ 1
2· ω22 (Π | ν0) · (x− ν0)2 · ε2
}´f (x) ·
{ω (Π | ν0) + ω2 (Π | ν0) · (x− ν0) · ε+ 1
2· ω22 (Π | ν0) · (x− ν0)2 · ε2
}· dx
=1 + ε · ω2(Π | ν0)
ω(Π | ν0)· (x− ν0) + 1
2· ε2 · ω22(Π | ν0)
ω(Π | ν0)· (x− ν0)2
1 + ε · ω2(Π | ν0)ω(Π | ν0)
· (µ− ν0) + 12· ε2 · ω22(Π | ν0)
ω(Π | ν0)·[σ2 + (µ− ν0)2] · f (x) ,
The natural place to center the expansion is at the maximum likelihood estimate of x given
Π, that is, νΠ = arg maxν0
´Πn (ν | ν0) · dν. The �rst-order conditions for this optimization
dictate
ˆΠ
n2 (ν | νΠ) · dν = 0,
=
ˆΠ
[n2 (ν | ν) + ε · n22 (ν | ν) · (νΠ − ν)] · dν = 0,
ε ·(νΠ ·ˆ
Π
n22 (ν | ν) · dν −ˆ
Π
ν · n22 (ν | ν) · dν)
= 0,
νΠ =
´Πν · n22 (ν | ν) · dν´
Πn22 (ν | ν) · dν
.
Setting ν0 = νΠ, we get
f (x |Π) =
[1 +
1
2· ε2 · ω22 (Π | νΠ)
ω (Π | νΠ)· (x− νΠ)2
]·[1− ε2 · 1
2· ω22 (Π | νΠ)
ω (Π | νΠ)·[σ2 + (µ− νΠ)2]]·f (x)
=
{1− 1
2· ε2 · ω22 (Π | νΠ)
ω (Π | νΠ)·([σ2 + (µ− νΠ)2]− (x− νΠ)2)} · f (x) .
At this point, the parallel with Lemma 1 is clear: agents will treat ν ∈ Π equivalently to
ν = νΠ. This means that
τ (Π) = −ε2 · E [t (µ)] · E[σ2]· ω22 (Π | νΠ)
ω (Π | νΠ)· (νΠ − E [µ | θ = µ]) .
55
For full revelation, this simpli�es to
τ (ν) = −ε2 · E [t (µ)] · E[σ2]· n22 (ν | ν)
n (ν | ν)· (ν − E [µ | θ = µ]) .
which matches what we derived in Theorem 2. Now, to prove the theorem, we need to
compare τ (Π) to the expectation of τ (ν) conditional on ν ∈ Π. To do so, we will need to
compute the probability of a given ν from whatever prior the principal might have:
Pr (ν | ν ∈ Π) =
´n (ν | x) · f (x) · dx,´
Π
´n (ν | x) · f (x) · dx · dν
=
´ [n (ν | ν) + 1
2· ε2 · n22 (ν | ν) · (x− ν)2] · f (x) · dx,´ [
ω (Π | νΠ) + 12· ε2 · ω22 (Π | νΠ) · (x− νΠ)2] · f (x) · dx
=n (ν | ν) + 1
2· ε2 · n22 (ν | ν) ·
[σ2 + (µ− ν)2] ´ (x− ν)2 · f (x) · dx,
ω (Π | νΠ) + 12· ε2 · ω22 (Π | νΠ) ·
[σ2 + (µ− νΠ)2]
=n (ν | ν)
ω (Π | νΠ)+O
(ε2).
Since this will be multiplied by ε2 anyway, we can ignore the O (ε2) term, leaving us with
E [τ (ν) | ν ∈ Π] = −ε2 · E [t (µ)] · E[σ2]·ˆ
Π
n22 (ν | ν)
n (ν | ν)· (ν − E [µ | θ = µ]) · Pr (ν | ν ∈ Π) · dν,
= −ε2 · E [t (µ)] · E[σ2]·ˆ
Π
n22 (ν | ν)
n (ν | ν)· (ν − E [µ | θ = µ]) · n (ν | ν)
ω (Π | νΠ)· dν,
= −ε2 · E [t (µ)] · E[σ2]·(´
Πν · n22 (ν | ν) · dνω (Π | νΠ)
− E [µ | θ = µ] ·´
Πn22 (ν | ν) · dνω (Π | νΠ)
),
= −ε2 · E [t (µ)] · E[σ2]·´
Πn22 (ν | ν) · dνω (Π | νΠ)
·(´
Πν · n22 (ν | ν) · dν´
Πn22 (ν | ν) · dν
− E [µ | θ = µ]
),
= −ε2 · E [t (µ)] · E[σ2]·´
Π[n22 (ν | νΠ) +O (ε)] · dν
ω (Π | νΠ)· (νΠ − E [µ | θ = µ]) ,
= −ε2 · E [t (µ)] · E[σ2]· ω22 (Π | νΠ)
ω (Π | νΠ)· (νΠ − E [µ | θ = µ]) ,
= τ (Π) .
So, for each element Π of the partition de�ning a revelation strategy, to leading order,
in expectation, reporting the actual ν instead of just ν ∈ Π is just as good. Hence full
reveleation is just as good as any revelation strategy.
56