On Optimum Strategies for Minimizing the Exponential...
Transcript of On Optimum Strategies for Minimizing the Exponential...
On Optimum Strategies for Minimizing the Exponential Moments
of a Loss Function
Neri Merhav
Department of Electrical EngineeringTechnion – Israel Institute of Technology
Technion City, Haifa 32000, [email protected]
Abstract
We consider a general problem of finding a strategy that minimizes the exponential momentof a given cost function, with an emphasis on its relation to the more common criterion ofminimization the expectation of the first moment of the same cost function. In particular, thebasic observation that we make and use is about simple sufficient conditions for a strategy to beoptimum in the exponential moment sense. This observation may be useful in various situations,and application examples are given. We also examine the asymptotic regime and investigateuniversal asymptotically optimum strategies in light of the aforementioned sufficient conditions,as well as phenomena of irregularities, or phase transitions, in the behavior of the asymptoticperformance, which can be viewed and understood from a statistical–mechanical perspective.Finally, we propose a new route for deriving lower bounds on exponential moments of certaincost functions (like the square error in estimation problems) on the basis of well known lowerbounds on their expectations.
Index Terms: loss function, exponential moment, large deviations, phase transitions, universalschemes.
1 Introduction
Many problems in information theory, communications, statistical signal processing, and related
disciplines can be formalized as being about the quest for a strategy s that minimizes (or maximizes)
the expectation of a certain cost function, ℓ(X, s), where X is a random variable (or a random
vector). Just a few examples of this generic paradigm are the following: (i) Lossless and lossy data
compression, where X symbolizes the data to be compressed, s is the data compression scheme,
and ℓ(X, s) is the length of the compressed binary representation, or the distortion (in the lossy
1
case) or a linear combination of both (see, e.g., [8, Chapters 5 and 10]). (ii) Gambling and portfolio
theory [8, Chapters 6 and 16], where cost function is logarithm of the wealth relative. (iii) Lossy
joint source–channel coding, where X collectively symbolizes the randomness of source and the
channel, s is the encoding–decoding scheme and ℓ(X, s) is the distortion in the reconstruction (see,
e.g., [53],[54]). (iv) Bayesian estimation of a random variable based on measurements, where X
designates jointly the desired random variable and the measurements, s is the estimation function
and ℓ(X, s) is the error function, for the example, the squared error. Non–Bayesian estimation
problems can be considered similarly (see, e.g., [48]). (v) Prediction, sequential decision problems
(see, for example, [35]) and stochastic control problems [6], such as the linear quadratic Gaussian
(LQG) problem, as well as general Markov decision processes, are also formalized in terms of
selecting strategies in order to minimize the expectation of a certain loss function.
While the criterion of minimizing the expected value of ℓ(X, s) has been predominantly the
most common one, the exponential moments of ℓ(X, s), namely, E exp{αℓ(X, s)} (α > 0), have
received much less attention than they probably deserve in this context, at least in information
theory and signal processing. In the realm of the theory of optimization and stochastic control, on
the other hand, the problem of minimizing exponential moments has received much more attention,
and it is well–known as the risk–sensitive or risk–averse cost function (see, e.g., [11], [16], [20], [24],
[50], [51] and many references therein), where one of the main motivations for using the exponential
function of ℓ(X, s) is to impose a penalty, or a risk, that is extremely sensitive to large values of
ℓ(X, s), hence the qualifier “risk–sensitive” in the name of this criterion. Another motivation is
associated with robustness properties of the resulting risk–sensitive optimum controllers [4], [18].
There are, in fact, a few additional motivations for examining strategies that minimize exponential
moments, which are also relevant to many problem areas of information theory, communications
and statistical signal processing. First and foremost, the exponential moment, E exp{αℓ(X, s)}, as
a function of α, is obviously the moment–generating function of ℓ(X, s), and as such, it provides
the full information about the entire distribution of this random variable, not just its first order
moment. Thus, in particular, if we are fortunate enough to find a strategy that uniformly minimizes
E exp{αℓ(X, s)} for all α ≥ 0 (and there are examples that this may be the case), then this is much
stronger than just minimizing the first moment. Secondly, exponential moments are intimately
related to large–deviations rate functions, and so, the minimization of exponential moments may
2
give us an edge on minimizing probabilities of (undesired) large deviations events of the form
Pr{ℓ(X, s) ≥ L0} (for some threshold L0), or more precisely, on maximizing the exponential rate of
decay of these probabilities. There are several works along this line, especially in contexts related to
buffer overflow in data compression [19], [25], [26], [30], [37], [47], [52], exponential moments related
to guessing [1], [2], [3], [29], [33], [38], large deviations properties of parameter/signal estimators,
[7], [27], [40], [45], [46], [55], and more.
It is natural to ask, in view of the foregoing discussion, how we can harness the existing body of
knowledge concerning optimization of strategies for minimizing the first moment of ℓ(X, s), which is
quite mature in many applications, in our quest for optimum strategies that minimize exponential
moments. Our basic observation, in this paper, provides a simple relationship between the two
criteria. In particular, in Section 2, we furnish sufficient conditions that the optimum strategy in
the exponential moment sense can be found in terms of the optimum strategy in the first moment
sense, for a possibly different probability distribution, which we characterize.
The main message of this expository paper is that the combination of these sufficient conditions
sets the stage for a useful tool that can be used to solve concrete problems of minimizing exponential
moments, and it may give a fresh look and a new insight into these problems. In some applications,
these sufficient conditions for optimality in the exponential moment sense, yield an equation in s,
whose solution is the desired optimum strategy. In other applications, however, this may not be
quite the case directly, yet the set of optimality conditions may still be useful: More often than
not, in a given instance of the problem under discussion, one may have a natural intuitive guess
concerning the optimum strategy, and then the optimality conditions can be used to prove that
this is the case.
At the heart of this paper stands a section of application examples (Section 3). One example for
the use of the proposed tool, that will be demonstrated in detail (and in more generality) later on,
is the following: Given n independent and identically distributed (i.i.d.) Gaussian measurements,
X1, . . . ,Xn, with mean θ, the sample mean, s(X1, . . . ,Xn) = 1n
∑ni=1 Xi, is the optimum unbiased
estimator of θ, not merely in the mean squared error sense (as is well known), but also in the sense
of minimizing all exponential moments of the squared error, i.e., E exp{α[s(X1, . . . ,Xn)− θ]2} for
all α ≥ 0 for which this expectation is finite. Another example belongs to the realm of universal
3
lossless data compression and the famous minimum description length principle (MDL), due to
Rissanen [44], who characterized the minimum achievable “price of universality” as a redundancy
term of about 12 log n (without normalization) for each unknown parameter of the information
source to be compressed. Here we show how this result (both the converse theorem and the direct
theorem) extends from the expected code–length criterion to all exponential moments of the code–
length. Yet another example is the memoryless Gaussian joint source–channel coding problem with
the quadratic distortion measure and without bandwidth expansion: As is well known, minimum
expected distortion can be achieved by optimum linear scalar encoders and decoders. When it
comes to exponential moments of the distortion, the behavior is somewhat more complicated. As
long as one forces a linear scalar encoder, the optimum decoder is also linear. However, once the
constraint of linear encoding is relaxed, both the optimum encoder and the optimum decoder are
no longer linear (see also [9] for a related work). Our above–mentioned basic principle sheds some
insight on this problem too.
We next devote some attention to the asymptotic regime (Section 4). Consider the case where
X is a random vector of dimension n, X = (X1, . . . ,Xn), governed by a product–form probabil-
ity distribution, and ℓ(X, s) grows linearly for a given empirical distribution of X, for example,
when ℓ(X, s) is additive, i.e., ℓ(X, s) =∑n
i=1 l(Xi, s). In this case, the exponential moments of
ℓ(X, s) typically behave (at least asymptotically) like exponential functions of n. If we can then
select a strategy s that somehow “adapts”1 to the empirical distribution of (X1, . . . ,Xn), then
such strategies may be universally optimum (or asymptotically optimum in the sense of achieving
the minimum exponential rate of the exponential moment) in that they depend on neither the
underlying probability distribution, nor on the parameter α. This is demonstrated in several ex-
amples, one of which is the above–mentioned extension of Rissanen’s famous results in universal
data compression [44].
An interesting byproduct of the use of the exponential moment criterion in the asymptotic
regime is the possible existence of phase transitions (Section 5): In turns out that the asymptotic
exponential rate of E exp{αℓ(X1, . . . ,Xn, s)} as a function of n, may not be a smooth function
of α and/or the parameters of the underlying probability distribution even when the model under
discussion seems rather simple and ‘innocent.’ This is best understood from a statistical–mechanical
1The precise meaning of this will be clarified in the sequel.
4
perspective, because in some cases, the calculation of the exponential moment is clearly analogous
to that of the partition function of a certain physical system of interacting particles, which is
known to exhibit phase transitions. It is demonstrated that at least in certain cases, these phase
transitions are not merely an artifact of a badly chosen strategy, but they appear even when the
optimum strategy is used, and hence these phase transitions are inherent in the model.
We end this paper by touching upon yet another aspect of the exponential moment criterion,
which we do not investigate very thoroughly here, but we believe it is interesting and therefore
certainly deserves a further study in the future (Section 6): Even in the ordinary setting, of seeking
strategies that minimize E{ℓ(X, s)}, optimum strategies may not always be known, and then lower
bounds are of considerable importance as a reference performance figure. This is a–fortiori the case
when exponential moments are considered. One way to obtain non–trivial bounds on exponential
moments is via lower bounds on the expectation of ℓ(X, s), using the techniques developed in this
paper. We demonstrate this idea in the context of a lower bound on the expected exponentiated
squared error of an unbiased parameter estimator, on the basis of the Cramer–Rao bound (CRB),
but it should be understood that, more generally, the same idea can be applied on the basis of
other well–known bounds of the mean-square error (Bayesian and non–Bayesian) in parameter
estimation, and in signal estimation, as well as in other problem areas.
2 The Basic Observation
Let X be a random variable taking on values in a certain alphabet X , and drawn according to a
given probability distribution P . The alphabet X may either be finite, countable, or a continuous
set. In the latter case, P (as well as other probability functions on X ) denotes a density with
respect to a certain measure µ on X , say the counting measure in the discrete case, or the Lebesgue
measure in the continuous case. Let the variable s designate a strategy, or an action, chosen from
some space S of allowed strategies. The term “strategy”, in our context, means a mathematical
object that, depending on the application, may be either a scalar variable, a vector, an infinite
sequence, a function (of X), or a function of another random variable/vector that is statistically
dependent on X. Associated with each x ∈ X and s ∈ S, is a loss ℓ(x, s). The function ℓ(x, s)
is called the loss function, or the cost function. The operator E{·} will be understood as the
5
expectation operator with respect to (w.r.t.) the underlying distribution P , and whenever we
refer to the expectation w.r.t. another probability distribution, say, Q, we use the notation EQ{·}.
Nonetheless, occasionally, when there is more than one probability distribution playing a role at
the same time and we wish to emphasize that the expectation is taken w.r.t. P , then to avoid
confusion, we may denote this expectation by EP {·}.
For a given α > 0, consider the problem of minimizing E exp{αℓ(X, s)} across s ∈ S. The
following observation relates the optimum s for this problem to the optimum s for the problem of
minimizing EQ{ℓ(X, s)} w.r.t. another probability distribution Q.
Observation 1 Assume that there exists a strategy s ∈ S for which
Z(s)∆= EP exp{αℓ(X, s)} < ∞. (1)
A strategy s ∈ S minimizes EP exp{αℓ(X, s)} if there exists a probability distribution Q on X that
satisfies the following two conditions at the same time:
1. The strategy s minimizes EQ{ℓ(X, s)} over S.
2. The probability distribution Q is given by
Q(x) =P (x)eαℓ(x,s)
Z(s). (2)
An equivalent formulation of Observation 1 is the following: denoting by sQ a strategy that
minimizes EQ{ℓ(X, s)} over S, then the sQ minimizes EP exp{αℓ(X, s)} over S if
Q(x) ∝ P (x)eαℓ(x,sQ), (3)
where by A(x) ∝ B(x), we mean that A(x)/B(x) is a constant, independent of x.
Proof. Let s ∈ S be arbitrary and let (s∗, Q∗) satisfy conditions 1 and 2 of Observation 1. Consider
the following chain of inequalities:
EP exp{αℓ(X, s)} = EQ∗ exp
{
αℓ(X, s) + lnP (X)
Q∗(X)
}
≥ exp {αEQ∗ℓ(X, s) − D(Q∗‖P )}
≥ exp {αEQ∗ℓ(X, s∗) − D(Q∗‖P )}
= Z(s∗) = EP exp{αℓ(X, s∗)}, (4)
6
where the first equality results from a change of measure (multiplying and dividing eαℓ(X,s) by
Q∗(X)), the second line is by Jensen’s inequality and the convexity of the exponential function
(with D(Q‖P )∆= EQ ln[Q(X)/P (X)] being the relative entropy between Q and P ), the third
line is by condition 1, and the next equality results from condition 2: On substituting Q∗(x) =
P (x)eαℓ(x,s∗)/Z(s∗) into D(Q∗‖P ), one readily obtains D(Q∗‖P ) = αEQ∗ℓ(X, s∗) − ln Z(s∗). This
completes the proof of Observation 1. �
Discussion: Several comments are in order at this point.
Partially related results have appeared in the literature of optimization and control (cf. [20,
Theorem 4.9]). However, in [20], a much more complicated and more involved paradigm (of con-
trolling finite–state Markov processes) has been considered, and the results therein do not seem to
be completely equivalent to Observation 1. Moreover, since the setting here is much simpler, then
so is the proof, which is not only short, but also almost free of regularity conditions (as opposed
to [20] and [11]). The only regularity condition needed here is that Z(s) < ∞ for some s ∈ S.
Obviously, without this condition, the problem under consideration is meaningless and empty in
the first place.
Note that for a given s, Jensen’s inequality
EQ exp
{
αℓ(X, s) + lnP (X)
Q(X)
}
≥ exp {αEQℓ(X, s) − D(Q‖P )} (5)
of (4) (but with Q∗ being replaced by a generic measure Q), becomes an equality for Q(x) =
P (x)eαℓ(x,s)/Z(s), since for this choice of Q, the random variable that appears in the exponent,
αℓ(X, s) + ln P (X)Q(X) , becomes degenerate (constant with probability one). Since the original ex-
pression is independent of Q, such an equality in Jensen’s inequality means that the expression
αEQℓ(X, s) − D(Q‖P ) is maximized by this choice of Q(x) = P (x)eα(x,s)/Z(s), a fact which can
also be seen from a direct maximization of this expression using standard methods. This leads
directly to the well–known identity (see, e.g., [11, Proposition 2.3]):
EP exp{αℓ(X, s)} = exp{maxQ
[αEQℓ(X, s) − D(Q‖P )]}, (6)
which is also intimately related to the well–known Laplace principle [17] in large deviations theory,
or more generally, to Varadhan’s integral lemma [12, Section 4.3].
7
In view of eq. (6), another look at the problem of minimizing EP exp{αℓ(X, s)} reveals that it
is equivalent2 to the minimax problem
mins
maxQ
F (s,Q) (7)
where
F (s,Q)∆= αEQℓ(X, s) − D(Q‖P ). (8)
Now, suppose that the set S and the loss function ℓ(x, s) are such that:
mins∈S
maxQ
F (s,Q) = maxQ
mins∈S
F (s,Q). (9)
This equality between the minimax and the maximin means that there is a saddle point (s∗, Q∗),
where s∗ is a solution of the minimax problem on the left–hand side and Q∗ is a solution to
the maximin problem on the right–hand side. As mentioned above, the maximizing Q in the
inner maximization on the left–hand side is Q∗(x) = P (x)eαℓ(x,s∗)/Z(s∗), which is condition 2 of
Observation 1. By the same token, the inner minimization over s on the right–hand side obviously
minimizes EQ∗ℓ(X, s), which is condition 1. This means then that the two conditions of Observation
1 are actually equivalent to the conditions for the existence of a saddle point of F (s,Q). This can
be considered as an alternative proof of Observation 1. The original proof above, however, is much
simpler: Not only is it shorter, but furthermore, it requires neither acquaintance with eq. (6) nor
with the theory of minimax/maximin optimization and saddle points.
When does eq. (9) hold? In general, the well–known sufficient conditions for
minu∈U
maxv∈V
f(u, v) = maxv∈V
minu∈U
f(u, v) (10)
are that U and V are convex sets (with u being allowed to take on values freely in U , independently
of v and vice versa), and that f is convex in u and concave in v. In our case, since the function
F (s,Q) is always concave in Q, this sufficient condition would automatically hold whenever ℓ(x, s)
is convex in s (for every fixed x), provided that S is a space in which convex combinations can
be well defined, and that S is a convex set. There are, of course, milder sufficient conditions that
allow commutation of minimization and maximization (see, e.g,, [43, Chapters 36 and 37]).
2See also [11].
8
A similar, but somewhat different, criterion pertaining to exponential moments, which is rea-
sonable to the same extent, is the dual problem of maxs∈S E exp{−αℓ(X, s)} (again, with α > 0),
which is called, in the jargon of stochastic control, a risk–seeking cost criterion, as opposed to the
risk-sensitive criterion discussed so far. If ℓ(x, s) is non-negative for all x and s, this has the ad-
vantage that the exponential moment is finite for all α > 0, as opposed to E exp{αℓ(X, s)} which,
in many cases, is finite only for a limited range of α. For the same considerations as before, here
we have:
maxs
E exp{−αℓ(X, s)} = maxs
exp{maxQ
[−αEQℓ(X, s) − D(Q‖P )]}
= exp{−mins
minQ
[αEQℓ(X, s) + D(Q‖P )]}, (11)
and so the optimality conditions relating s and Q are similar to those of Observation 1 (with α
replaced by −α), except that now we have a double minimization problem rather than a minimax
problem. However, it should be noted that here the conditions of Observation 1 are only necessary
conditions, as for the above equalities to hold, the pair (s,Q) should globally minimize the function
F (s,Q), unlike the earlier case, where only a saddle point was sought.3 On the other hand,
another advantage of this criterion, is that even if one cannot solve explicitly the equation for the
optimum s, then the double minimization naturally suggests an iterative algorithm: starting from an
initial guess s0 ∈ S, one computes Q0(x) ∝ P (x) exp{−αℓ(x, s0)} (which minimizes [αEQℓ(X, s) +
D(Q‖P )] over Q), then one finds s1 = arg mins∈S EQ0{ℓ(X, s)}, and so on. It is obvious that
E exp{−αℓ(X, si)}, i = 0, 1, 2, . . ., increases (and hence improves) from iteration to iteration. This
is different from the minimax situation we encountered earlier, where successive improvements are
not guaranteed.
3 Applications
Observation 1 tells us that if we are fortunate enough to find a strategy s ∈ S and a probability
distribution Q, which are ‘matched’ to one another (in the sense defined by the above conditions),
then we have solved the problem of minimizing the exponential moment. Sometimes it is fairly easy
to find such a pair (s,Q) by solving an equation. In other cases, there might be a natural guess
3In other words, it is not enough now that s and Q are in ‘equilibrium’ in the sense that s is a minimizer for agiven Q and vice versa.
9
for the optimum s, which can be proved optimum by checking the conditions. In yet some other
cases, Observation 1 suggests an iterative algorithm for solving the problem. In this section, we
will see a few application examples of all these types. Some of these examples could have been also
solved directly (and/or the resulting conclusions may even already be known from the literature),
without using Observation 1, but for others, this does not seem to be a trivial task. In any case,
Observation 1 may suggest a fresh look and some new insight on the problem.
3.1 Lossless Data Compression
We begin with a very simple example. Let X be a random variable taking on values in a finite
alphabet X , let s be a probability distribution on X , i.e., a vector {s(x), x ∈ X} with∑
x∈X s(x) = 1
and s(x) ≥ 0 for all x ∈ X , and let ℓ(x, s)∆= − ln s(x). This example is clearly motivated by lossless
data compression, as − ln s(x) is the length function (in nats) pertaining to a uniquely decodable
code that is induced by a distribution s, ignoring integer length constraints. In this problem, one
readily observes that the optimum s for minimizing EQ{− ln s(X)} is sQ = Q. Thus, by eq. (3),
we seek a distribution Q such that
Q(x) ∝ P (x) exp{−α ln Q(x)} =P (x)
[Q(x)]α(12)
which means [Q(x)]1+α ∝ P (x), or equivalently, Q(x) ∝ [P (x)]1/(1+α). More precisely,
sQ(x) = Q(x) =[P (x)]1/(1+α)
∑
x′∈X [P (x′)]1/(1+α). (13)
Note that here ℓ(x, s) is convex in s and so, the minimax condition holds. While this result is
well known and it could have been obtained even without using Observation 1, our purpose in this
example was to show how Observation 1 gives the desired solution even more easily than with the
direct method, by solving a very simple equation.
3.2 Quantization
Consider the problem of quantizing a real–valued random variable X, drawn by P , into M repro-
duction levels, x0, x1, . . . , xM−1, and let the distortion metric d(x, x) be quadratic, i.e., d(x, x) =
(x − x)2. The ordinary problem of optimum quantizer design is about the choice of a function
s : X → {x0, x1, . . . , xM−1}, that minimizes EP [X − s(X)]2, i.e., in this case, ℓ(x, s) = [x − s(x)]2.
10
As is well known [21], [23], [28], in general, this problem lacks a closed–form solution, and the
customary approach is to apply an iterative algorithm for quantizer design, which alternates between
two sets of necessary conditions for optimality: the nearest–neighbor condition, according to which
s(x) should be the reproduction level that is closest to x (i,e,, the one that minimizes (x − xi)2
over i), and the centroid condition, which means that xi should be the conditional expectation of
X given that X falls in the interval of values of x that are to be mapped to the i–th quantization
level.
Consider now the criterion of maximizing the negative exponential moment EP e−α[X−s(X)]2 .
Here the centroid condition is no longer relevant. However, in light of the discussion in Section 2,
one can use the fact that this problem is equivalent to the double minimization of
G(s,Q)∆= αEQ[X − s(X)]2 + D(Q‖P ) (14)
over s and Q. This suggests an iterative algorithm that consists of two nested loops: The outer loop
alternates between minimizing s for a given Q, on the one hand, and minimizing Q for a given s, on
the other hand. As explained in Section 2, these two minimizations are, in principle, nothing but
the conditions of Observation 1, just with α being replaced by −α. The inner loop implements the
former ingredient of minimizing EQ[X − s(X)]2 over s for a given Q, which is again implementable
by the standard iterative procedure that was described in the previous paragraph. Of course, it is
not guaranteed that such an iterative algorithm would converge to the global minimum of G(s,Q),
just like in the case of ordinary iterative quantizer design.
As a simple example for a combination of s and Q that are matched (in the above sense),
consider the case where P is the uniform distribution over the interval [−A,+A]. The optimum
MMSE quantizer in this case is uniform as well: The interval [−A,A] is partitioned evenly into M
sub-intervals, each of size 2A/M and the reproduction level xi, pertaining to each sub-interval, is
its midpoint, What happens when the exponential moment is considered? Let us ‘guess’ that the
same quantizer s remains optimum. Then,
Q(x) ∝ 1{|x| ≤ A} · exp{−α(x − s(x))2} = 1{|x| ≤ A} · exp{−α mini
(x − xi)2}, (15)
which means that Q has “Gaussian peaks” at all reproduction points {xi}. It is plausible that
the same uniform quantizer s continues to be optimum (or at least nearly so) for Q, and hence
11
these s and Q match each other. Moreover, at least for large α, this quantizer nearly attains the
rate–distortion function of Q at distortion level D = 1/(2α). To see why this is true, observe that
for large α, the factor exp{−αmini(x − xi)2} = maxi exp{−α(x − xi)
2} is well approximated by∑
i exp{−α(x − xi)2}, which after normalization of Q, becomes essentially a mixture of M evenly
weighted Gaussians, where the i–th mixture component is centered at xi, i = 0, 1, . . . ,M − 1. This
mixture approximation of Q can then be viewed as a convolution between the uniform discrete
distribution on {xi} and the Gaussian density N (0, 1/(2α)). Thus, for D ≤ 1/(2α), the rate–
distortion function of Q agrees with the Shannon lower bound (see, e.g., [22, Chapter 4]), which
is
RL(D) = h(Q) −1
2log(2πeD)
≈ log M +1
2log(πe
α
)
−1
2log(2πeD)
= log M +1
2log
(
1
2αD
)
, (16)
which, for D = 1/(2α), indeed gives a coding rate of log M , just like the uniform quantizer. Here,
h(Q) stands for the differential entropy pertaining to Q.
Finally, consider the case where the source vector X = (X1, . . . ,Xn) is to be quantized by a
sequential causal quantizer with memory, i.e., Xt is quantized into one of M quantization levels,
which are now allowed to depend on past outputs of the quantizer X1, . . . , Xt−1. Here, the relevant
exponential moment criterion would be EP exp{−α∑n
t=1[Xt − s(Xt|X1, . . . , Xt−1)]2}. As is shown
in [36], however, whenever the source P is memoryless, the allowed dependence of the current
quantization on the past does not improve the exponential moment performance, i.e., the optimum
quantizer of this type makes use of the current symbol Xt only. This means that the causal vector
quantization problem actually degenerates back to the scalar quantization problem considered in
the previous paragraphs. This continues to be true even if variable–rate coding is allowed, except
that then time–sharing between at most two quantizers must also be allowed. These results of [36],
for the exponential moment criterion, are analogous to those of Neuhoff and Gilbert [42] for the
ordinary criterion of expected code length for a given distortion (or expected distortion at fixed
rate).
12
3.3 Non–Bayesian Parameter Estimation
Let X = (X1,X2, . . . ,Xn)T be a Gaussian random vector with mean vector θu, where θ ∈ IR and
u = (u1, . . . , un)T is a given deterministic vector. Let the non–singular n × n covariance matrix of
X be given by Λ. The vector X can then be thought of as a set of measurements (contaminated
by non–white Gaussian noise) of a signal, represented by a known vector u, but with an unknown
gain θ to be estimated. This is a classical problem in non–Bayesian estimation theory. It is well
known that for this kind of a model, among all unbiased estimators of θ, the one that minimizes
the mean square error (or equivalently, the estimation error variance) is the maximum likelihood
(ML) estimator, which in this case, is easily found to be given by
s(x) =uT Λ−1x
uT Λ−1u. (17)
Does this estimator also minimize E exp{α[s(X) − θ]2} among all unbiased estimators and for all
values of α in the allowed range?
The class S of all unbiased estimators is clearly a convex set and (s − θ)2 is convex in s. Let
us ‘guess’ that this estimator indeed minimizes also E exp{α[s(X) − θ]2} and then check whether
it satisfies the conditions of Observation 1. Denoting v = Λ−1u/(uT Λ−1u), the corresponding
probability measure Q, which will be denoted here by Qθ, is given by
Qθ(x) ∝ exp
{
−1
2(x − θu)T Λ−1(x − θu) + α
(
vT x − θ)2}
= exp
{
−1
2(x − θu)T Λ−1(x − θu) + α
[
vT (x − θu)]2}
= exp
{
−1
2(x − θu)T Λ−1(x − θu) + α(x − θu)T vvT (x − θu)
}
= exp
{
−1
2(x − θu)T (Λ−1 − 2αvvT )(x − θu)
}
, (18)
where α is chosen small enough such that the matrix (Λ−1 − 2αvvT ) is still positive definite. Now,
the ML estimator of θ under Qθ is given by
sQ(x) =uT (Λ−1 − 2αvvT )x
uT (Λ−1 − 2αvvT )u
=uT Λ−1x − 2αuT vvT x
uT Λ−1u − 2αuT vvT u
=[1 − 2α/(uT Λ−1u)]uT Λ−1x
[1 − 2α/(uT Λ−1u)]uT Λ−1u
= s(x), (19)
13
where in the third line we have used the fact that uT v = 1. In other words, we are back to
the original estimator we started from, which means that our ‘guess’ was successful. Indeed, the
MSE of s(x) under Qθ, which is vT (Λ−1 − 2αvvT )−1v, can easily be shown to be identical to the
Cramer–Rao lower bound under Qθ, which is given by 1/[uT (Λ−1 − 2αvvT )u]. We can therefore
summarize our conclusion in the following proposition:
Proposition 1 Let X be a Gaussian random vector with a mean vector θu and a non–singular
covariance matrix Λ. Let a be the supremum of all values of α such that the matrix (Λ−1 − 2αvvT )
is positive definite, where v = Λ−1u/(uT Λ−1u). Then, among all unbiased estimators of θ, the
estimator s(x) = vT x uniformly minimizes the exponential moment E exp{α[s(X) − θ]2} for all
α ∈ (0, a). This minimum of the exponential moment is given by
E exp{
α(
vT X − θ)2}
=1
√
det(I − 2αvvT Λ). (20)
It is easy to see that for α → 0, the limit of 1α ln E exp{α(vT X − θ)2} recovers the mean–
square error tr{vvT Λ} = vT Λv, as expected. Here we have the full best achievable characteristic
function and hence also the best achievable large deviations performance in the sense of minimizing
of probabilities of the form Pr{|s(X) − θ| ≥ R} for all R > 0.
Related results by Kester and Kallenberg [27] (and some subsequent works) concern the asymp-
totic large deviations optimality of the ML estimator, among all consistent estimators, for the case
of exponential families of i.i.d. measurements. Here, on the other hand, the model is not i.i.d.,
the result holds (for exponential moments) for all n, and not only asymptotically, and the class of
competing estimators is the class of unbiased estimators.
3.4 Gaussian–Quadratic Joint Source–Channel Coding
Consider the Gaussian memoryless source
PU (u) = (2πσ2u)−n/2 exp
{
−1
2σ2u
n∑
i=1
u2i
}
(21)
and the Gaussian memoryless channel y = x + z, where the noise is distributed according to
PZ(z) = (2πσ2z )−n/2 exp
{
−1
2σ2z
n∑
i=1
z2i
}
. (22)
14
In the ordinary joint source–channel coding problem, one seeks an encoder and decoder that would
minimize D = 1n
∑ni=1 E{(Ui − Vi)
2}, where V = (V1, . . . , Vn) is the reconstruction at the decoder.
It is very well known that the best achievable distortion, in this case, is given by
D =σ2
u
1 + Γ/σ2z
, (23)
where Γ is the maximum power allowed at the transmitter, and it may be achieved by a transmitter
that simply amplifies the source by a gain factor of√
Γ/σ2u and a receiver that implements linear
MMSE estimation of Ui given Yi, on a symbol–by–symbol basis.
What happens if we replace the criterion of expected distortion by the criterion of the expo-
nential moment on the distortion, E exp{α∑
i(Ui − Vi)2}? It is natural to wonder whether simple
linear transmitters and receivers, of the kind defined in the previous paragraph, are still optimum.
The random object X, in this example, is the pair of vectors (U ,Z), where U is the source vector
and Z is the channel noise vector, which under P = PU×PZ , are independent Gaussian i.i.d. random
vectors with zero mean and variances σ2u and σ2
z , respectively, as said. Our strategy s consists of
the choice of an encoding function x = f(u) and a decoding function v = g(y). The class S is then
the set of all pairs of functions {f, g}, where f satisfies the power constraint EP {‖f(U )‖2} ≤ nΓ.
Condition 2 of Observation 1 tells us that the modified probability distribution of u and z should
be of the form
Q(u,z) ∝ PU (u)PZ(z) exp
{
α
n∑
i=1
[ui − gi(f(u) + z)]2
}
(24)
where gi is restriction of g to the i–th component of v.
Clearly, if we continue to restrict the encoder f to be linear, with a gain of√
Γ/σ2u, which simply
exploits the allowed power Γ, and the only remaining room for optimization concerns the decoder
g, then we are basically dealing with a problem of pure Bayesian estimation in the Gaussian regime,
and then the optimum choice of the decoder (estimator) continues to be the same linear decoder
as before (see [41]).4 However, once we extend the scope and allow f to be a non–linear encoder,
then the optimum choice of f and g would no longer remain linear like in the expected distortion
case. It is not difficult to see that the conditions of Observation 1 are no longer met for any
linear functions f and g. The key reason is that while Q(u,z) of eq. (24) continues to be Gaussian
4This can also be obtained by applying Observation 1 to the problem of Bayesian estimation in the Gaussianregime under the exponential moment criterion. See Section 3.2 in the original version of this paper [32].
15
(though now Ui and Zi are correlated) when f and g are linear, the power constraint, EP {‖X‖2} ≤
nΓ, when expressed as an expectation w.r.t. Q, becomes EQ{‖f(U )‖2P (U )/Q(U )} ≤ nΓ, but
“power” function ‖f(u)‖2P (u)/Q(u), with P and Q being Gaussian densities, is no longer the
usual quadratic function of f(u) for which there is a linear encoder and decoder that is optimum.
Another way to see that linear encoders and decoders are suboptimal, is to consider the following
argument: For a given n, the expected exponentiated squared error is minimized by a joint source–
channel coding system, defined over a super-alphabet of n–tuples, with respect to a distortion
measure, defined in terms of a single super–letter, as
d(u,v) = exp
{
α
n∑
i=1
(ui − vi)2
}
. (25)
For such a joint source–channel coding system to be optimal, the induced channel P (v|u) must [5,
p. 31, eq. (2.5.13)] be proportional to
P (v) exp{−βd(u,v)} = P (v) exp
[
−β exp
{
α∑
i
(ui − vi)2)
}]
(26)
for some β > 0, which is the well–known structure of the optimum test channel that attains the
rate–distortion function for the Gaussian source and the above defined distortion measure. Had
the aforementioned linear system been optimum, the optimum output distribution P (v) would
be Gaussian, and then P (v|u) would remain proportional to a double exponential function of∑
i(ui − vi)2. However, the linear system induces instead a Gaussian channel from u to v, which
is very different, and therefore cannot be optimum.
Of course, the minimum of E exp{α∑
i(Ui − Vi)2} can be approached by separate source- and
channel coding, defined on blocks of super–letters formed by n–tuples. The source encoder is an
optimum rate–distortion code for the above defined ‘single–letter’ distortion measure, operating at
a rate close to the channel capacity, and the channel code is constructed accordingly to support
the same rate.
4 Universal Asymptotically Optimum Strategies
The optimum strategy for minimizing EP exp{αℓ(X, s)} depends, in general, on both P and α. It
turns out, however, that this dependence on P and α can sometimes be relaxed if one gives up the
ambition of deriving a strictly optimum strategy, and resorts to asymptotically optimum strategies.
16
Consider the case where, instead of one random variable X, we have a random vector X =
(X1, . . . ,Xn), governed by a product distribution function
P (x) =n∏
i=1
P (xi), (27)
where each component xi of the vector x = (x1, . . . , xn) takes on values in a finite set X . If
ℓ(x, s) grows linearly5 with n for a given empirical distribution of x and a given s ∈ S, then it is
expected that the exponential moment E exp{αℓ(x, s)} would behave, at least asymptotically, as
an exponential function of n. In particular, for a given s, let us assume that the limit
limn→∞
1
nln EP exp{αℓ(X , s)}
exists. Let us denote this limit by E(s, α, P ). An asymptotically optimum strategy is then a strategy
s∗ for which
E(s∗, α, P ) ≤ E(s, α, P ) (28)
for every s ∈ S. An asymptotically optimum strategy s∗ is called universal asymptotically optimum
w.r.t. a class P of probability distributions, if s∗ is independent of α and P , yet it satisfies eq.
(28) for all α in the allowed range, every s ∈ S, and every P ∈ P. In this section, we take P to
be the class of all memoryless sources with a given finite alphabet X ,6 We denote by TQ the type
class pertaining to an empirical distribution Q, namely, the set of vectors x ∈ X n whose empirical
distribution is Q.
Suppose there exists a strategy s∗ and a function λ : P → IR such that following two conditions
hold:
(a) For every type class TQ and every x ∈ TQ, ℓ(x, s∗) ≤ n[λ(Q) + o(1)], where o(1) designates a
(positive) sequence that tends to zero as n → ∞.
(b) For every type class TQ and every s ∈ S,
∣
∣
∣
∣
TQ ∩ {x : ℓ(x, s) ≥ n[λ(Q) − o(1)]}
∣
∣
∣
∣
≥ e−no(1)|TQ|. (29)
5This happens, for example, when ℓ is additive, i.e., ℓ(x, s) =Pn
i=1l(xi, s).
6It should be understood that extensions of the following discussion to more general classes of sources, like theclass of Markov sources, is essentially straightforward in many cases, although there may be elements that mightrequire some more caution.
17
It is then a straightforward exercise to show, using the method of types, that s∗ is a universal
asymptotically optimum strategy w.r.t. P, with
E(s∗, α, P ) = maxQ
[αλ(Q) − D(Q‖P )], (30)
where condition (a) supports the direct part and condition (b) supports the converse part. The
interesting point here then is not quite in the last statement, but in the fact that there are quite a
few application examples where these two conditions hold at the same time.
Before we provide such examples, however, a few words are in order concerning conditions (a)
and (b). Condition (a) means that there is a choice of s∗, that does not depend on x or on its
type class,7 yet the performance of s∗, for every x ∈ TQ, “adapts” to the empirical distribution Q
of x in a way, that according to condition (b), is “essentially optimum” (i.e., cannot be improved
significantly), at least for a considerable (non–exponential) fraction of the members of TQ. It is
instructive to relate conditions (a) and (b) above to conditions 1 and 2 of Observation 1. First,
observe that in order to guarantee asymptotic optimality of s∗, condition 2 of Observation 1 can
be somewhat relaxed: For Jensen’s inequality in (4) to remain exponentially tight, it is no longer
necessary to make the random variable αℓ(X , s) + ln[P (X)/Q(X)] completely degenerate (i.e., a
constant for every realization x, as in condition 2 of Observation 1), but it is enough to keep it
essentially fixed across a considerably large subset of the dominant type class, TQ∗ , i.e., the one
whose empirical distribution Q∗ essentially achieves the maximum of [αλ(Q) − D(Q‖P )]. Taking
Q∗(x) to be the memoryless source induced by the dominant Q∗, this is indeed precisely what
happens under conditions (a) and (b), which imply that
αℓ(x, s∗) + lnP (x)
Q∗(x)≈ nαλ(Q) +
n∑
i=1
lnP (xi)
Q∗(xi)
= nαλ(Q) + n∑
x∈X
Q∗(x) lnP (x)
Q∗(x)
= n[αλ(Q∗) − D(Q∗‖P )], (31)
for (at least) a non–exponential fraction of the members of TQ∗ , namely, a subset of TQ∗ that is large
enough to maintain the exponential order of the (dominant) contribution of TQ∗ to E exp{αℓ(x, s∗)}.
Loosely speaking, the combination of conditions (a) and (b) also means then that s∗ is essentially
7As before, s∗ is chosen without observing the data first.
18
optimum for (this subset of) TQ∗ , which is a reminiscence of condition 1 of Observation 1. More-
over, since s∗ “adapts” to every TQ, in the sense explained above, then this has the flavor of the
maximin problem discussed in Section 2, where s is allowed to be optimized for each and every Q.
Since the minimizing s, in the maximin problem, is independent of P and α, this also explains the
universality property of such a strategy.
Let us now discuss a few examples. The first example is that of fixed–rate rate–distortion
coding. A vector X that emerges from a memoryless source P is to be encoded by a coding scheme
s with respect to a given additive distortion measure, based on a single–letter distortion measure
d : X × X → IR, X being the reconstruction alphabet. Let DQ(R) denote the distortion–rate
function of a memoryless source Q (with a finite alphabet X ) relative to the single–letter distortion
measure d and let ℓ(x, s) designate the distortion between the source vector x and its reproduction,
using a rate–distortion code s. It is not difficult to see that this example meets conditions (a) and
(b) with λ(Q) = DQ(R): Condition (a) is based on the type covering lemma [10, Section 2.4],
according to which each type class TQ can be completely covered by essentially enR ‘spheres’ of
radius nDQ(R) (in the sense of d), centered at the reproduction vectors. Thus s∗ can be chosen
to be a scheme that encodes x in two parts, the first of which is a header that describes the index
of the type class TQ of x (whose description length is proportional to log n) and the second part
encodes the index of the codeword within TQ, using nR nats. Condition (b) is met since there is
no way to cover TQ with exponentially less than enR spheres within distortion less than DQ(R).
By the same token, consider the dual problem of variable–rate coding within a maximum allowed
distortion D. In this case, every source vector x is encoded by ℓ(x, s) nats, and this time, conditions
(a) and (b) apply with the choice λ(Q) = RQ(D), which is the rate–distortion function of Q (the
inverse function of DQ(R)). The considerations are similar to those of the first example.
It is interesting to particularize this example, of variable–rate coding, to the lossless case, D = 0
(thus revisiting Subsection 3.1), where RQ(0) = H(Q), the empirical entropy associated with Q. In
this case, a more refined result can be obtained, which extends a well known result due to Rissanen
[44] in universal data compression: According to [44], given a length function of a lossless data
compression ℓ(x, s) (s being the data compression scheme), and given a parametric class of sources
of n–vectors, {Pnθ }, indexed by a parameter θ ∈ Θ ⊂ IRk, a lower bound on EP n
θℓ(X, s), that
19
applies to most8 values of θ, is given by
EP nθℓ(X, s) ≥ H(Pn
θ ) + (1 − ǫ)k
2log n, (32)
where ǫ > 0 is arbitrarily small (for large n), H(Pnθ ) is the entropy of X associated with Pn
θ , and
EP nθ{·} is the expectation under Pn
θ . On the other hand, the same expression is achievable, by
a number of universal coding schemes, provided that the factor (1 − ǫ) in the above expression is
replaced by (1 + ǫ).
Now, for a given sources Pnθ , let us define Qn
θ as being the source probability function that is
proportional to (Pnθ )1/(1+α). Then, as a lower bound, we have
ln EP nθ
exp{αℓ(X , s)} = maxQ
[αEQℓ(X , s) − nD(Q‖Pnθ )]
≥ αEQnθℓ(X , s) − D(Qn
θ ‖Pnθ )
≥ α
[
H(Qnθ ) + (1 − ǫ)
k
2log n
]
− D(Qnθ ‖P
nθ )
= αH1/(1+α)(Pnθ ) + α(1 − ǫ)
k
2log n, (33)
where the third line follows from Rissanen’s lower bound (for most sources), and where Hu(Pnθ ) is
Renyi’s entropy of order u, namely,
Hu(Pnθ ) =
1
1 − uln
{
∑
x∈Xn
[Pnθ (x)]u
}
. (34)
Consider now the case where {Pnθ , θ ∈ Θ} is the class of all memoryless sources over X , where
the parameter vector θ designates k = |X | − 1 letter probabilities. In this case, since the source
is completely defined by the single–letter probabilities, we can omit the superscript n of Pnθ and
denote the source by Pθ. Define a two–part code s∗, which first encodes the index of the type class
Q and then the index of x within the type class. The corresponding length function is given by
ℓ(x, s∗) = ln |TQ| + k log n ≈ nH(x) +k
2log n, (35)
where H(x) is the empirical entropy pertaining to x, and where the approximate inequality is easily
8“Most values of θ” means all values of θ with the possible exception of a subset of Θ whose Lebesgue measuretends to zero as n tends to infinity.
20
obtained by the Stirling approximation. Then,
ln EPθexp{αℓ(X , s)} = ln EPθ
exp{αnH(X)} + αk
2log n
= ln EPθexp{α min
Q[− ln Q(X)]} + α
k
2log n
≤ minQ
lnEPθexp{−α ln Q(X)} + α
k
2log n
= nαH1/(1+α)(Pθ) + αk
2log n, (36)
and then it essentially achieves the lower bound. Rissanen’s result is now obtained a special case
of this, by dividing both sides of the inequality by α and then taking the limit α → 0.
We next summarize these findings in the form of a theorem, which is an exponential–moment
counterpart of [44, Theorem 1]. The converse part (part (a)) can actually be extended to even more
general classes of sources, which are not even necessarily parametric, using the results of [34], where
the expression (k log n)/(2n) is replaced, more generally, by the capacity of the “channel” from θ to
X,9 as defined by the class of sources {Pθ} when viewed as a set of conditional distributions of X
given θ. For the sake of simplicity, the direct part (part (b)) of this theorem is formalized for the
class of all memoryless sources with a given finite alphabet, parametrized by the letter probabilities,
but it can also be extended to wider classes of sources, like Markov sources of a given order.
Theorem 1 (a) Converse part: Let P = {Pnθ , θ ∈ Θ} be a parametric class of finite–alphabet
memoryless sources, indexed by a parameter θ that takes on values in a compact subset Θ
of IRk. Let the central limit theorem hold, under Qnθ , for the ML estimator of each θ in the
interior of Θ. If ℓ(x, s) is a length function of a code s satisfying the Kraft inequality, then
for every α > 0 and ǫ > 0,
1
nαln EP n
θexp{αℓ(X , s)} ≥ H1/(1+α)(P
nθ ) + (1 − ǫ) ·
k log n
2n, (37)
for all points θ ∈ Θ except for a set Aǫ(n) ⊂ Θ whose Lebesgue measure vanishes as n → ∞
for every fixed ǫ > 0.
(b) Direct part: For the case where P is the class of all memoryless sources with a given alphabet
of size k + 1 and θ designates the vector of the k first letter probabilities, there exists a uni-
9Moreover the converse part can also be harnessed to derive lower bounds on exponential moments of universalprediction w.r.t. similar classes of sources, using the results of [35, eq. (32)].
21
versal lossless data compression code s∗, whose length function ℓ(x, s∗) satisfies the reversed
inequality where the factor (1 − ǫ) is replaced by (1 + ǫ).
Our last example corresponds to a secrecy system. A sequence x is to be communicated to a
legitimate decoder which shares with the transmitter a random key z of nR purely random bits.
The encoder transmits an encrypted message y = φ(x,z), which is an invertible function of x given
z, and hence decipherable by the legitimate decoder. An eavesdropper, which has no access to the
key z, submits a sequence of guesses concerning x until it receives an indication that the last guess
was correct (e.g., a correct guess of a password admits the eavesdropper into a secret system). For
the best possible encryption function φ, what would be the optimum guessing strategy s∗ that the
eavesdropper may apply in order to minimize the α–th moment of the number of guesses G(X , s),
i.e., E{Gα(X, s)}? In this case, ℓ(x, s) = lnG(x, s). As is shown in [33], there exists a guessing
strategy s∗, which for every x ∈ TQ, gives ℓ(x, s∗) ≈ n min{H(Q), R}, a quantity that essentially
cannot be improved upon by any other guessing strategy, for most members of TQ. In other words,
conditions (a) and (b) apply with λ(Q) = min{H(Q), R}.
5 Phase Transitions
Another interesting aspect of the asymptotic behavior of the exponential moment is the possible
appearance of phase transitions, i.e., irregularities in the exponent function E(s, α, P ) even in some
very simple and ‘innocent’ models. By irregularities, we mean a non–smooth behavior, namely,
discontinuities in the derivatives of E(s, α, P ) with respect to α and/or the parameters of the
source P .
One example that exhibits phase transitions is that of the secrecy system, mentioned in the
last paragraph of the previous section. As is shown in [33], the optimum exponent E(s∗, α, p) for
this case consists of two phase transitions as a function of R (namely, three different phases). In
particular,
E(s∗, α, P ) =
αR R < H(P )(α − θR)R + θRH1/(1+θR)(P ) H(P ) ≤ R ≤ H(Pα)
αH1/(1+α)(P ) R > H(Pα)(38)
22
where Pα is the distribution defined by
Pα(x)∆=
P 1/(1+α)(x)∑
x′∈X P 1/(1+α)(x′), (39)
H(Q) is the Shannon entropy associated with a distribution Q, Hu(Q) is the Renyi entropy of
order u as defined before, and θR is the unique solution of the equation R = H(Pθ) for R in the
range H(P ) ≤ R ≤ H(Pα). But this example may not really be extremely surprising due to the
non–smoothness of the function λ(Q) = min{H(Q), R}.
It may be somewhat less expected, however, to witness phase transitions also in some very
simple and ‘innocent’ looking models. One way to understand the phase transitions in these cases
comes from the statistical–mechanical perspective. It turns out that in some cases, the expression
of the exponential moment is analogous to that of a partition function of a certain many–particle
physical system with interactions, which may exhibit phase transitions and these phase transitions
correspond the above–mentioned irregularities.
We now demonstrate a very simple model, which has phase transitions. Consider the case where
X is a binary vector whose components take on values in X = {−1,+1}, and which is governed by
a binary memoryless source Pµ with probabilities Pr{Xi = +1} = 1 − Pr{Xi = −1} = (1 + µ)/2
(µ designating the expected ‘magnetization’ of each binary spin Xi, to make the physical analogy
apparent). The probability of x under Pµ is thus easily shown to be given by
Pµ(x) =
(
1 + µ
2
)(n+P
i xi)/2
·
(
1 − µ
2
)(n−P
i xi)/2
=
(
1 − µ2
4
)n/2
·
(
1 + µ
1 − µ
)
P
i xi/2
. (40)
Consider the estimation of the parameter µ by the ML estimator
µ =1
n
n∑
i=1
xi. (41)
How does the exponential moment of Eµ exp{αn(µ − µ)2} behave? A straightforward derivation
yields
Eµ exp{αn(µ − µ)2} =
(
1 − µ2
4
)n/2
enαµ2∑
x
(
1 + µ
1 − µ
)
P
i xi/2
exp
α
n
(
∑
i
xi
)2
− 2αµ∑
i
xi
=
(
1 − µ2
4
)n/2
enαµ2∑
x
exp
(
1
2ln
1 + µ
1 − µ− 2αµ
)
∑
i
xi +α
n
(
∑
i
xi
)2
.
23
The last summation over {x} is exactly the partition function pertaining to the Curie–Weiss model
of spin arrays in statistical mechanics (see, e.g., [39, Subsection 2.5.2]), where the magnetic field is
given by
B =1
2ln
1 + µ
1 − µ− 2αµ (42)
and the coupling coefficient for every pair of spins is J = 2α. It is well known that this model
exhibits phase transitions pertaining to spontaneous magnetization below a certain critical temper-
ature. In particular, using the method of types [10], this partition function can be asymptotically
evaluated as being of the exponential order of
exp
{
n · max|m|≤1
[
h2
(
1 + m
2
)
+ Bm +J
2· m2
]}
,
where h2(·) is the binary entropy function, which stands for the exponential order of the number
of configurations {x} with a given value of m = 1n
∑
i xi. This expression is clearly dominated
by a value of m (the dominant magnetization m∗) which maximizes the expression in the square
brackets, i.e., it solves the equation
m = tanh(Jm + B), (43)
or in our variables,
m = tanh
(
2αm +1
2ln
1 + µ
1 − µ− 2αµ
)
. (44)
For α < 1/2, there is only one solution and there is no spontaneous magnetization (paramagnetic
phase). For α > 1/2, however, there are three solutions, and only one of them dominates the
partition function, depending on the sign of B, or equivalently, on whether α > α0(µ)∆= 1
4µ ln 1+µ1−µ
or α < α0(µ) and according to the sign of µ. Accordingly, there are five different phases in the plane
spanned by α and µ. The paramagnetic phase α < 1/2, the phases {µ > 0, 1/2 < α < α0(µ)} and
{µ < 0, α > α0(µ)}, where the dominant magnetization m is positive, and the two complementary
phases, {µ < 0, 1/2 < α < α0(µ)} and and {µ > 0, α > α0(µ)}, where the dominant magnetization
is negative. Thus, there is a multi-critical point where the boundaries of all five phases meet, which
the point (µ, α) = (0, 1/2). The phase diagram is depicted in Fig. 1.
Yet another example of phase transitions is that of fixed–rate lossy data compression, discussed
in the previous section. To demonstrate this explicitly, consider the binary symmetric source
(BSS) and the Hamming distortion measure d, and consider a random selection of a rate–R code
24
α
µ+1−1
α = 1/2
α = α0(µ)
paramagnetic phase
m∗ > 0
m∗ > 0 m∗ < 0
m∗ < 0
0
Figure 1: Phase diagram in the plane of (µ, α).
by nenR independent fair coin tosses, one for each of the n components of every one of the enR
codewords. It was shown in [31] that the asymptotic exponent of the negative exponential moment,
E exp{−α∑
i d(Ui, Vi)} (where the expectation is w.r.t. both the source and the random code
selection), is given by the following expression, which obviously exhibits a (second order) phase
transition:
limn→∞
1
nln E exp
{
−α∑
i
d(Ui, Vi)
}
=
{
−αδ(R) α ≤ α(R)−α + ln(1 + eα) + R − ln 2 α > α(R)
(45)
where δ(R) is the distortion–rate function of the BSS w.r.t. the Hamming distortion measure and
α(R) = ln1 − δ(R)
δ(R). (46)
The analysis in [31] is based on the random energy model (REM), [13],[14],[15], a well-known
statistical–mechanical model of spin glasses with strong disorder, which is known to exhibit phase
transitions. Moreover, it is shown in [31] that ensembles of codes that have an hierarchical structure
may have more than one phase transition.
6 Lower Bounds on Exponential Moments
As explained in the Introduction, even in the ordinary setting, of the quest for minimizing E{ℓ(X, s)},
optimum strategies may not always be known, and then useful lower bounds are very important.
25
This is definitely the case when exponential moments are considered, because the exponential mo-
ment criterion is even harder to handle. To obtain non–trivial bounds on exponential moments, we
propose to harness lower bounds on the expectation of ℓ(X, s), possibly using a change of measure,
in the spirit of the proof of Observation 1 and the previous example of a lower bound on universal
lossless data compression. We next demonstrate this idea in the context of a lower bound on the
expected exponentiated squared error of an unbiased estimator, on the basis of the Cramer–Rao
bound (CRB). The basic idea, however, is applicable more generally, e.g., by relying on other well–
known Bayesian/non–Bayesian bounds on the mean-square error (e.g., the Weiss–Weinstein bound
for Bayesian estimation [49]), as well as in bounds on signal estimation (filtering, prediction, etc.),
and in other problem areas as well. Further investigation in the line may be of considerable interest.
Consider a parametric family of probability distributions {Pθ, θ ∈ Θ}, Θ ⊆ IR being the
parameter set, and suppose that we are interested in a lower bound on Eθ exp{α(θ − θ)2}, for any
unbiased estimator of θ, where as before, Eθ denotes expectation w.r.t. Pθ. Consider the following
chain of inequalities, which holds for any θ′ ∈ Θ:
Eθ exp{α(θ − θ)2} = Eθ′ exp
{
α(θ − θ)2 + lnPθ(X)
Pθ′(X)
}
≥ exp{
αEθ′(θ − θ)2 − D(Pθ′‖Pθ)}
= exp{
αEθ′(θ − θ′)2 + α(θ − θ′)2 − D(Pθ′‖Pθ)}
≥ exp{
αCRB(θ′) + α(θ − θ′)2 − D(Pθ′‖Pθ)}
, (47)
where CRB(θ) is the Cramer–Rao bound for unbiased estimators, computed at θ (i.e., CRB(θ) =
1/I(θ), where I(θ) is the Fisher information). Since this lower bound applies for every θ′ ∈ Θ, one
can take its supremum over θ′ ∈ Θ and obtain
ln Eθ exp{α(θ − θ)2} ≥ supθ′∈Θ
[
αCRB(θ′) + α(θ′ − θ)2 − D(Pθ′‖Pθ)]
. (48)
More generally if θ = (θ1, . . . , θk)T is a parameter vector (thus θ ∈ Θ ⊆ IRk) and α ∈ IRk is an
arbitrary deterministic (column) vector, then
ln Eθ exp{αT (θ − θ)(θ − θ)Tα} ≥ supθ′∈Θ
[
αT I−1(θ′)α + [αT (θ′ − θ)]2 − D(Pθ′‖Pθ)]
, (49)
where here I(θ) is the Fisher information matrix and I−1(θ) is its inverse.
26
It would be interesting to further investigate bounds of this type, in parameter estimation in
particular, and in other problem areas in general, and to examine when these bounds may be tight
and useful.
Acknowledgment
Interesting discussions with Rami Atar are acknowledged with thanks.
References
[1] E. Arikan, “An inequality on guessing and its application to sequential decoding,” IEEE
Trans. Inform. Theory, vol. IT–42, no. 1, pp. 99–105, January 1996.
[2] E. Arikan and N. Merhav, “Guessing subject to distortion,” IEEE Trans. Inform. Theory,
vol. 44, no. 3, pp. 1041–1056, May 1998.
[3] E. Arikan and N. Merhav, “Joint source–channel coding and guessing with application to
sequential decoding,” IEEE Trans. Inform. Theory, vol. 44, no. 5, pp. 1756–1769, September
1998.
[4] R. Atar, P. Dupuis, and A. Shwartz, “An escape–time criterion for queueing networks: asymp-
totic risk–sensitive control via differential games,” Mathemtics of Operation Research, vol. 28,
no. 4, pp. 801–835, November 2003.
[5] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice
Hall, Englewood Cliffs, New Jersey, U.S.A., 1971.
[6] D. Bertsekas, Dynamic Programming and Optimal Control, Vol. I and II, 3rd ed. Nashua,
NH: Athena Scientific, 2007.
[7] J. P. N. Bishwal, “Large deviations and Berry–Esseen inequalities for estimators in nonlin-
ear nonhomogeneous diffusions,” REVSTAT Statistical Journal, vol. 5, no. 3, pp. 249–267,
November 2007.
27
[8] T. M. Cover and J. A. Thomas, Elements of Information Theory, Second Edition, John Wiley
& Sons, Hoboken, New Jersey, U.S.A., 2006.
[9] I. Csiszar, “On the error exponent of source-channel transmission with a distortion threshold,”
IEEE Trans. Inform. Theory , vol. IT–28, no. 6, pp. 823–828, November 1982.
[10] I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless
Systems, Academic Press, New York, U.S.A., 1981.
[11] P. Dai Pra, L. Meheghini, and W. J. Runggaldier, “Connections between stochastic control
and dynamic games,” Mathematics of Control, Signals and Systems, vol. 9, pp. 303–326, 1996.
[12] A. Dembo and O. Zeitouni, Large DEviations and Applications, Jones and Bartlett Publishers,
London 1993.
[13] B. Derrida, “Random–energy model: limit of a family of disordered models,” Phys. Rev.
Lett., vol. 45, no. 2, pp. 79–82, July 1980.
[14] B. Derrida, “The random energy model,” Physics Reports (Review Section of Physics Letters),
vol. 67, no. 1, pp. 29–35, 1980.
[15] B. Derrida, “Random–energy model: an exactly solvable model for disordered systems,” Phys.
Rev. B, vol. 24, no. 5, pp. 2613–2626, September 1981.
[16] G. B. Di Masi and L. Stettner, “Risk–sensitive control of discrete–time Markov processes
with infinite horizon,” SIAM Journal on Control and Optimization, vol. 38, no. 1, pp. 61–78,
1999.
[17] P. Dupuis and R. S. Ellis, A Weak Convergence Approach to the Theory of Large Deviations,
John Wiley & Sons, 1997.
[18] P. Dupuis, M. R. James, and I. Petersen, “Robust properties of risk–sensitive control,” Math-
ematics of Control, Signals and Systems, vol. 13, pp. 318–322, 2000.
[19] N. Farvardin and J. W. Modestino, “On overflow and underflow problems in buffer instru-
mented variable-length coding of fixed-rate memoryless sources,” IEEE Transactions on In-
formation Theory , vol. IT–32, no. 6, pp. 839–845, November 1986.
28
[20] W. H. Fleming and D. Hernandez–Hernandez, “Risk–sensitive control of finite state machines
on an infinite horizon I,” SIAM Journal on Control and Optimization, vol. 35, no. 5, pp. 1790–
1810, 1997.
[21] R. M. Gray, “Vector quantization,” IEEE Transactions on Acoustics, Speech, and Signal
Processing , Mag: 4–29, April 1984.
[22] R. M. Gray, Source Coding Theory, Kluwer Academic Publishers, Norwell, MA, U.S.A., 1990.
[23] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Inform. Theory , vol. IT–44,
no. 6, pp. 2325–2383, October 1998.
[24] R. Howard and J. Matheson, “Risk–sensitive Markov decision processes,” Management Sci-
ence, vol. 18, pp. 356–369, 1972.
[25] P. A. Humblet, “Generalization of Huffman coding to minimize the probability of buffer
overflow,” IEEE Transactions on Information Theory , vol. IT–27, no. 2, pp. 230–232, March
1981.
[26] F. Jelinek, “Buffer overflow in variable length coding of fixed rate sources,” IEEE Transactions
on Information Theory , vol. IT–14, no. 3, pp. 490–501, May 1968.
[27] A. D. M. Kester and W. C. M. Kallenberg, “Large deviations of estimators,” Annals of
Mathmatical Statistics, vol. 14, no. 2, pp. 648–664, 1986.
[28] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE
Transactions on Communications, vol. COM–28, no. 1, pp. 84–95, January 1980.
[29] J. L. Massey, “Guessing and entropy,” Proc. 1994 IEEE Int. Symp. Inform. Theory, (ISIT
‘94), p. 204, Trondheim, Norway, 1994.
[30] N. Merhav, “Universal coding with minimum probability of code word length overflow,” IEEE
Trans. Inform. Theory, vol. 37, no. 3, pp. 556–563, May 1991.
[31] N. Merhav, “The generalized random energy model and its application to the statistical
physics of ensembles of hierarchical codes,” IEEE Trans. Inform. Theory, vol. 55, no. 3, pp.
1250–1268, March 2009.
29
[32] N. Merhav, “On optimum strategies for minimizing the exponential moments of a given cost
function,” http://arxiv.org/PS cache/arxiv/pdf/1103/1103.2882v1.pdf
[33] N. Merhav and E. Arikan, “The Shannon cipher system with a guessing wiretapper,” IEEE
Trans. Inform. Theory, vol. 45, no. 6, pp. 1860–1866, September 1999.
[34] N. Merhav and M. Feder, “A strong version of the redundancy–capacity theorem of universal
coding,” IEEE Trans. Inform. Theory, vol. 41, no. 3, pp. 714-722, May 1995.
[35] N. Merhav and M. Feder, “Universal prediction,” IEEE Trans. Inform. Theory, vol. 44, no.
6, pp. 2124–2147, October 1998.
[36] N. Merhav and I. Kontoyiannis, “Source coding exponents for zero–delay coding with finite
memory,” IEEE Trans. Inform. Theory, vol. 49, no. 3, pp. 609–625, March 2003.
[37] N. Merhav and D. L. Neuhoff, “Variable-to-fixed length codes provide better large deviations
performance than fixed-to-variable length codes,” IEEE Trans. Inform. Theory, vol. 38, no. 1,
pp. 135–140, January 1992.
[38] N. Merhav, R. M. Roth, and E. Arikan, “Hierarchical guessing with a fidelity criterion,” IEEE
Trans. Inform. Theory, vol. 45, no. 1, pp. 330–337, January 1999.
[39] M. Mezard and A. Montanari, Information, Physics and Computation, Oxford University
Press, 2009.
[40] M. N. Mishra and B. L. S. Prakasa Rao, “ Large deviation probabilities for maximum likeli-
hood estimator and Bayes estimator of a parameter for fractional Ornstein–Uhlenbeck type
process,” Bulletin of Information and Cybernetics, vol. 38, pp. 71–83, 2006.
[41] J. B. Moore, R. J. Elliott, and S. Dey, “Risk–sensitive generalizations of minimum variance
estimation and control,” Journal of Mathematical Systems and Control, vol. 7, no. 1, pp.
1–15, 1997.
[42] D. L. Neuhoff and R. K. Gilbert, “Causal source codes,” IEEE Trans. Inform. Theory ,
vol. IT–28, no. 5, pp. 701–713, September 1982.
[43] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, U.S.A., 1972.
30
[44] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Transactions
on Information Theory , vol. IT–30, no. 4, pp. 629–636, July 1984.
[45] S. Sherman, “Non–mean square error criteria,” IRE Transactions on Information Theory,
vol. IT-4, pp. 125–126, September 1958.
[46] A. Sieders and K. Dzhaparidze, “A large deviations result for parameter estimators and its
application to non–linear regression analysis,” Annals of Statistics, vol. 15, no. 3, pp, 1031–
1049, 1987.
[47] O. Uchida and T. S. Han, “The optimal overflow and underflow probabilities with variable–
length coding for the general source,” preprint 1999.
[48] H. van Trees, Detection, Estimation, and Modulation Theory, Part I, John Wiley & Sons,
New York, 1968.
[49] A. J. Weiss and E. Weinstein, “A lower bound on the mean square error in random parameter
estimation,” IEEE Trans. Inform. Theory , vol. IT–31, no. 5, pp. 680–682, September 1985.
[50] P. Whittle, “Risk–sensitive linear/quadratic/Gaussian control,” Advances in Applied Proba-
bility, vol. 13, pp. 764–777, 1981.
[51] P. Whittle, “A risk–sensitive maximum principle,” Systems and Control Letters, vol. 15, pp.
183–192, 1990.
[52] A. D. Wyner, “On the probability of buffer overflow under an arbitrary bounded input-output
distribution,” SIAM Journal on Applied Mathematics, vol. 27, no. 4, pp. 544–570, December
1974.
[53] A. D. Wyner, “Another look at the coding theorem of information theory- a tutorial,” Proc.
IEEE , vol. 58, no. 6, pp. 894–913, June 1980.
[54] A. D. Wyner, “Fundamental limits in information theory,” Proc. IEEE , vol. 69, no. 2, pp. 239–
251, February 1981.
[55] A. D. Wyner and J. Ziv, “On communication of analog data from a bounded source space,”
Bell Systems Technical Journal, vol. 48, pp. 3139–3172, 1969.
31