Shannon Theory for Compressed Sensing - Yale Universityyw562/reprints/wu.thesis.pdf · 2012. 1....

Shannon Theory for CompressedSensing

Yihong Wu

A DissertationPresented to the Facultyof Princeton University

in Candidacy for the Degreeof Doctor of Philosophy

Recommended for Acceptanceby the Department of

Electrical EngineeringAdviser: Professor Sergio Verdu

September 2011

Abstract

Compressed sensing is a signal processing technique to encode analog sources by realnumbers rather than bits, dealing with efficient recovery of a real vector from theinformation provided by linear measurements. By leveraging the prior knowledge ofthe signal structure (e.g., sparsity) and designing efficient non-linear reconstructionalgorithms, effective compression is achieved by taking a much smaller number ofmeasurements than the dimension of the original signal. However, partially due tothe non-discrete nature of the problem, none of the existing models allow a completeunderstanding of the theoretical limits of the compressed sensing paradigm.

As opposed to the conventional worst-case (Hamming) approach, this thesispresents a statistical (Shannon) study of compressed sensing, where signals aremodeled as random processes rather than individual sequences. This frameworkencompasses more general signal models than sparsity. Focusing on optimal decoders,we investigate the fundamental tradeoff between measurement rate and reconstruc-tion fidelity gauged by error probability and noise sensitivity in the absence andpresence of measurement noise, respectively. Optimal phase transition thresholdsare determined as a functional of the input distribution and compared to suboptimalthresholds achieved by various popular reconstruction algorithms. In particular,we show that Gaussian sensing matrices incur no penalty on the phase transitionthreshold of noise sensitivity with respect to optimal nonlinear encoding. Our resultsalso provide a rigorous justification of previous results based on replica heuristics inthe weak-noise regime.

iii

Acknowledgements

I am deeply indebted to my advisor Prof. Sergio Verdu for his constant guidance andsupport at every stage of my Ph.D. studies, without which this dissertation wouldbe impossible. A superb researcher and mentor, he taught me how to formulate aproblem in the cleanest possible way. His brilliant insights, mathematical eleganceand aesthetic taste greatly shaped my research. In addition, I appreciate his endlesspatience and the freedom he gave, which allow me to pursue my research interests thatsometimes are well outside the convex hull of the conventional information theory.

I thank Prof. Sanjeev Kulkarni and Prof. Paul Cuff for serving as my thesis readerand their valuable feedbacks. I thank Prof. Shlomo Shamai and Prof. Phillipe Rigolletfor being on my defense committee. I thank Prof. Mung Chiang and Prof. RobertCalderbank for serving as my oral general examiners. I am also grateful to Prof.Mung Chiang for being my first-year academic advisor and my second-year researchco-advisor.

I would like to thank the faculty of the Department of Electrical Engineering,Department of Mathematics and Department of Operational Research and FinancialEngineering for offering great curricula and numerous inspirations, especially, Prof.Robert Calderbank, Prof. Erhan Cinlar, Prof. Andreas Hamel, Prof. Elliot Lieb andDr. Tom Richardson, for their teaching.

I am grateful to Dr. Erik Ordentlich and Dr. Marcelo Weinberger for being mymentors during my internship at HP labs, where I spent a wonderful summer in2010. I would also like to acknowledge Prof. Andrew Barron, Prof. Dongning Guo,Dr. Jouni Luukkainen, Prof. Shlomo Shamai and Dr. Pablo Shmerkin for stimulatingdiscussions and fruitful collaborations.

I have been fortunate to have many friends and colleagues who made my fiveyears at Princeton an unforgettable experience: Zhenxing Wang, Yiyue Wu, ZhenXiang, Yuejie Chi, Chunxiao Li, Can Sun, Yu Yao, Lin Han, Chao Wang, YongxinXi, Guanchun Wang, Tiance Wang, Lorne Applebaum, Tiffany Tong, Jiming Yu,Qing Wang, Ying Li, Vaneet Aggarwal, Aman Jain, Victoria Kostina among them.In particular, I thank Yury Polyanskiy for numerous rewarding discussions, duringevery one of which I gained new knowledge. I thank Ankit Gupta for his philosophicalteaching and endless wisdom. I thank Haipeng Zheng for a decade of true friendship.My dearest friends outside Princeton, Yu Zhang, Bo Jiang, Yue Zhao, Sheng Zhou,are always there for me during both joyful and stressful times, to whom I will alwaysbe thankful.

Finally, it is my greatest honor to thank my family: my mother, my father, mygrandparents, my uncle and auntie. No words could possibly express my deepestgratitude for their love, self-sacrifice and unwavering support. To them I dedicatethis dissertation.

iv

To my family.

v

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 11.1 Compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Phase transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Signal models: sparsity and beyond . . . . . . . . . . . . . . . . . . . 31.4 Three dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Renyi information dimension 72.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Characterizations and properties . . . . . . . . . . . . . . . . . . . . . 92.3 Evaluation of information dimension . . . . . . . . . . . . . . . . . . 122.4 Information dimension of order α . . . . . . . . . . . . . . . . . . . . 132.5 Interpretation of information dimension as entropy rate . . . . . . . . 152.6 Connections with rate-distortion theory . . . . . . . . . . . . . . . . . 162.7 High-SNR asymptotics of mutual information . . . . . . . . . . . . . 182.8 Self-similar distributions . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8.2 Information dimension . . . . . . . . . . . . . . . . . . . . . . 21

2.9 Information dimension under projection . . . . . . . . . . . . . . . . . 23

3 Minkowski dimension 253.1 Minkowski dimension of sets and measures . . . . . . . . . . . . . . . 253.2 Lossless Minkowski-dimension compression . . . . . . . . . . . . . . . 273.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 MMSE dimension 404.1 Basic setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Various connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Asymptotic statistics . . . . . . . . . . . . . . . . . . . . . . . 444.3.2 Optimal linear estimation . . . . . . . . . . . . . . . . . . . . 45

vi

4.3.3 Asymptotics of Fisher information . . . . . . . . . . . . . . . . 464.3.4 Duality to low-SNR asymptotics . . . . . . . . . . . . . . . . . 47

4.4 Relationship between MMSE dimension and Information Dimension . 474.5 Evaluation of MMSE dimension . . . . . . . . . . . . . . . . . . . . . 49

4.5.1 Data processing lemma for MMSE dimension . . . . . . . . . 504.5.2 Discrete input . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5.3 Absolutely continuous input . . . . . . . . . . . . . . . . . . . 514.5.4 Mixed distribution . . . . . . . . . . . . . . . . . . . . . . . . 544.5.5 Singularly continuous distribution . . . . . . . . . . . . . . . . 56

4.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.6.1 Approximation by discrete inputs . . . . . . . . . . . . . . . . 574.6.2 Self-similar input distribution . . . . . . . . . . . . . . . . . . 584.6.3 Non-Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . 58

4.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Noiseless compressed sensing 635.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 Noiseless compressed sensing versus data compression . . . . . 635.1.2 Regularities of coding schemes . . . . . . . . . . . . . . . . . . 64

5.2 Definitions and Main results . . . . . . . . . . . . . . . . . . . . . . . 665.2.1 Lossless data compression . . . . . . . . . . . . . . . . . . . . 665.2.2 Lossless analog compression with regularity conditions . . . . 675.2.3 Coding theorems . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Lossless Linear Compression . . . . . . . . . . . . . . . . . . . . . . . 735.3.1 Minkowski-dimension compression and linear compression . . 735.3.2 Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Lossless Lipschitz Decompression . . . . . . . . . . . . . . . . . . . . 835.4.1 Geometric measure theory . . . . . . . . . . . . . . . . . . . . 835.4.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.4.3 Achievability for finite mixture . . . . . . . . . . . . . . . . . 865.4.4 Achievability for singular distributions . . . . . . . . . . . . . 89

5.5 Lossless linear compression with Lipschitz decompressors . . . . . . . 945.6 Lossless stable decompression . . . . . . . . . . . . . . . . . . . . . . 97

6 Noisy compressed sensing 1016.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2 Distortion-rate tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Optimal encoder . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.2 Optimal linear encoder . . . . . . . . . . . . . . . . . . . . . . 1036.2.3 Random linear encoder . . . . . . . . . . . . . . . . . . . . . . 1036.2.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.3 Phase transition of noise sensitivity . . . . . . . . . . . . . . . . . . . 1076.4 Least-favorable input: Gaussian distribution . . . . . . . . . . . . . . 1086.5 Non-Gaussian inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

vii

6.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7 Suboptimality of practical reconstruction algorithms 1217.1 Noiseless compressed sensing . . . . . . . . . . . . . . . . . . . . . . . 1217.2 Noisy compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . 123

8 Conclusions and future work 1268.1 Universality of sensing matrix ensemble . . . . . . . . . . . . . . . . . 1268.2 Near-optimal practical reconstruction algorithms . . . . . . . . . . . . 1278.3 Performance of optimal sensing matrices . . . . . . . . . . . . . . . . 1278.4 Compressive signal processing . . . . . . . . . . . . . . . . . . . . . . 128

A Proofs in Chapter 2 130A.1 The equivalence of (2.7) to (2.4) . . . . . . . . . . . . . . . . . . . . . 130A.2 Proofs of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 131A.3 Proofs of Lemmas 1 – 2 . . . . . . . . . . . . . . . . . . . . . . . . . 132A.4 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133A.5 A non-asymptotic refinement of Theorem 9 . . . . . . . . . . . . . . . 134A.6 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135A.7 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136A.8 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . 136A.9 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

B Proofs in Chapter 4 145B.1 Proof of Theorem 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . 145B.2 Calculation of (4.17), (4.18) and (4.93) . . . . . . . . . . . . . . . . . 146B.3 Proof of Theorem 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B.3.1 Doob’s relative limit theorem . . . . . . . . . . . . . . . . . . 147B.3.2 Mixture of two mutually singular measures . . . . . . . . . . . 149B.3.3 Mixture of two absolutely continuous measures . . . . . . . . . 152B.3.4 Finite Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . 154B.3.5 Countable Mixture . . . . . . . . . . . . . . . . . . . . . . . . 155

B.4 Proof of Theorem 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . 156B.5 Proof of Theorems 23 – 25 . . . . . . . . . . . . . . . . . . . . . . . . 158

B.5.1 Proof of Theorem 24 . . . . . . . . . . . . . . . . . . . . . . . 158B.5.2 Proof of Theorem 23 . . . . . . . . . . . . . . . . . . . . . . . 160B.5.3 Proof of Theorem 25 . . . . . . . . . . . . . . . . . . . . . . . 161

B.6 Proof of Theorem 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . 163B.7 Proof for Remark 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

C Proofs in Chapter 5 169C.1 Injectivity of the cosine matrix . . . . . . . . . . . . . . . . . . . . . . 169C.2 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169C.3 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170C.4 Proof of Lemma 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

viii

C.5 Proof of Lemma 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171C.6 Proof of Lemma 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

D Distortion-rate tradeoff of Gaussian inputs 175D.1 Optimal encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175D.2 Optimal linear encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 175D.3 Random linear encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Bibliography 179

ix

List of Tables

5.1 Regularity conditions of encoder/decoders and corresponding mini-mum ε-achievable rates. . . . . . . . . . . . . . . . . . . . . . . . . . 68

x

List of Figures

1.1 Compressed sensing: an abstract setup. . . . . . . . . . . . . . . . . . 2

4.1 Plot of snr mmse(X, snr) against log3 snr, where X has a ternary Cantordistribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Plot of snr · mmse(X,N, snr)/varN against log3 snr, where X has aternary Cantor distribution and N is uniformly distributed in [0, 1]. . 60

4.3 Plot of mmse(X,N, snr) against snr, where X is standard Gaussian andN is standard Gaussian, equiprobable Bernoulli or Cantor distributed(normalized to have unit variance). . . . . . . . . . . . . . . . . . . . 62

6.1 Noisy compressed sensing setup. . . . . . . . . . . . . . . . . . . . . . 1016.2 For sufficiently small σ2, there are arbitrarily many solutions to (6.16). 1056.3 D∗(XG, R, σ

2), D∗L(XG, R, σ2), DL(XG, R, σ

2) against snr = σ−2. . . . . 1096.4 D∗(XG, R, σ

2), D∗L(XG, R, σ2), DL(XG, R, σ

2) against R when σ2 = 1. . 1096.5 Worst-case noise sensitivity ζ∗, ζ∗L and ζL for the Gaussian input, which

all become infinity when R < 1 (the unstable regime). . . . . . . . . . 111

7.1 Comparison of suboptimal thresholds (7.4) – (7.6) with optimal thresh-old. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.2 Phase transition of the asymptotic noise sensitivity: sparse signalmodel (1.2) with γ = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . 125

C.1 A schematic illustration of τ in terms of M -ary expansions. . . . . . . 172

xi

Chapter 1

Introduction

1.1 Compressed sensing

Current data acquisition methods are often extremely wasteful: massive amountsof data are acquired then discarded by a subsequent compression stage. In contrast, arecent approach known as compressed sensing [1, 2], combines the data collection andcompression and reduces enormously the volume of data at much greater efficiencyand lower costs. A paradigm of encoding of analog sources by real numbers ratherthan bits, compressed sensing deals with efficient recovery of sparse vectors from theinformation provided by linear measurements. By leveraging the prior knowledge ofthe signal structure (e.g., sparsity) and designing efficient non-linear reconstructionalgorithms, effective compression is achieved by taking a much smaller number ofmeasurements than the dimension of the original signal.

An abstract setup of compressed sensing is shown in Fig. 1.1: A real vector xn ∈ Rn

is mapped into yk ∈ Rk by an encoder (or compressor) f : Rn → Rk. The decoder (ordecompressor) g : Rk → Rn receives yk, a possibly noisy version of the measurement,and outputs xn as the reconstruction. The measurement rate, i.e., the number ofmeasurements per sample or the dimensionality compression ratio, is given by

R =k

n. (1.1)

The goal is to reconstruct the unknown vector xn as faithfully as possible. In theclassical setup of compressed sensing, the encoder f is constrained to be a linearmapping characterized by a k×n matrix A, called the sensing or measurement matrix.Furthermore, A is usually assumed to be a random matrix, for example, consisting ofi.i.d. entries or randomly sampled Fourier coefficients. In order to guarantee stablereconstruction in the presence of measurement inaccuracy, we usually also requirethe decoder to be robust, i.e., Lipschitz continuous. The fundamental question is:under practical constraints such as linearity of the encoder and/or robustness of thedecoder, what is the minimal measurement rate so that the input vector can beaccurately reconstructed?

1

Encoderf : Rn→Rk +

ek

Decoderg: Rk→Rn

xn yk yk xn

Figure 1.1: Compressed sensing: an abstract setup.

As opposed to the conventional worst-case (Hamming) approach, in this thesis weintroduce a statistical (Shannon) framework for compressed sensing, where signalsare modeled as random processes rather than individual sequences and performanceis measured on an average basis. This setup can also be understood as a Bayesianformulation of compressed sensing where the input distribution serves as the prior ofthe signal. Our goal is to seek the fundamental limit of the measurement rate as afunction of the input statistics. Similar approaches have been previously adopted inthe literature, for example, [3, 4, 5, 6, 7, 8, 9, 10, 11].

In addition to linear encoders, we are also concerned with non-linear encodersmainly for the following reasons. First, studying the fundamental limits of nonlinearencoders provides converse results (lower bound on the measurement rate) for linearencoding. Second, it enables us to gauge the suboptimality incurred by linearityconstraints on the encoding procedures.

1.2 Phase transition

When the measurements are noiseless, our goal is to reconstruct the original sig-nal as perfectly as possible by driving the error probability to zero as the ambientdimension n grows. For many input processes (e.g., memoryless), it turns out thatthere exists a threshold for the measurement rate, above which it is possible to achievea vanishing error probability and below which the error probability will eventuallyapproaches one for any sequence of encoder-decoder pairs. Such a phenomenon isknown as phase transition in statistical physics. In information-theoretic terms, wesay that strong converse holds.

When the measurement is noisy, exact signal recovery is obviously impossible.Instead, our objective is to pursue robust reconstruction. To this end, we consideranother benchmark called the noise sensitivity, defined as the ratio between mean-square reconstruction error and the noise variance. Similar to the behavior of errorprobability in the noiseless case, there exists a phase transition threshold of measure-ment rate which only depends on the input statistics, above which noise sensitivityis bounded for all noise variance, and below which the noise sensitivity blows up asnoise variance tends to zero.

Clearly, to every sequence of encoders and decoders we might associate a phasetransition threshold. In our paper we will be primarily concerned with optimal de-coding procedures, as opposed to pragmatic low-complexity algorithms. For instance,the optimal decoder in the noisy case is the minimum mean-square error (MMSE)

2

estimator, i.e., the conditional expectation of the input vector given the noisy mea-surements. In view of the prohibitive computational complexity of optimal decoders,various practical suboptimal reconstruction algorithms have been proposed, most no-tably, decoders based on convex optimizations such as `1-minimization (i.e., BasisPursuit) [12] and `1-penalized least-squares (i.e. LASSO) [13], greedy algorithmssuch as matching pursuit [14], graph-based iterative decoders such as approximatemessage passing (AMP) [4], etc. These algorithms in general have strictly higherphase-transition thresholds. In addition to optimal decoders, we also consider thecase where we have the freedom to optimize over measurement matrices.

In the case of noiseless compressed sensing, the interesting regime of measurementrates is between zero and one. When the measurements are noisy, in principle it makessense to consider measurement rate greater than one in order to combat the noise.Nevertheless, the optimal phase transition for noise sensitivity is always less thanone, because with k = n and an invertible measurement matrix, the linear MMSEestimator achieves bounded noise sensitivity for any noise variance.

1.3 Signal models: sparsity and beyond

Sparse vectors, supported on a subspace with dimension much smaller than theambient dimension, are of great significance in signal processing and statistical mod-eling. One suitable probabilistic model for sparse signals is the following mixturedistribution [5, 15, 4, 7]:

P = (1− γ)δ0 + γ Pc, (1.2)

where δ0 denotes the Dirac measure at 0, Pc is a probability measure absolutelycontinuous with respect to the Lebesgue measure, and 0 ≤ γ ≤ 1. Consider a randomvector Xn = [X1, . . . , Xn]T independently drawn from P . By the weak law of largenumbers, if γ < 1, Xn is sparse with overwhelming probability. More precisely,1n‖Xn‖0

P−→γ, where the “`0 norm”1 ‖xn‖0 =∑n

i=1 1|xi|>0 denotes the support sizeof a vector. This corresponds to the regime of proportional (or linear) sparsity. In(1.2), the weight on the continuous part γ parametrizes the signal sparsity: less γcorresponds to more sparsity, while Pc serves as the prior distribution of non-zeroentries.

Apart from sparsity, there are other signal structures that have been previouslyexplored in the compressed sensing literature [15, 16]. For example, the so-calledsimple signal in infrared absorption spectroscopy [16, Example 3, p. 914] is such thateach entry of of the signal vector is constrained in the unit interval, with most ofthe entries saturated at the boundaries (0 or 1). Similar to the rationale that leadsto (1.2), an appropriate statistical model for simple signals is a memoryless inputprocess drawn from

P = (1− γ)(pδ0 + (1− p)δ1

)+ γPc, (1.3)

1Strictly speaking, ‖·‖0 is not a norm because ‖λxn‖0 6= |λ| ‖xn‖0 for λ 6= 0 or ±1. However,‖x− y‖0 gives a valid metric on Rn.

3

where Pc is an absolutely continuous probability measure supported on the unit in-terval and 0 ≤ p ≤ 1.

Generalizing both (1.2) and (1.3), we consider discrete-continuous mixed distri-butions (i.e., elementary distributions in the parlance of [17]) of the following form:

P = (1− γ)Pd + γPc, (1.4)

where Pd and Pc are discrete and absolutely continuous respectively. Although mostof our results in this thesis hold for arbitrary input distributions, we will be focusingon discrete-continuous mixtures because of its relevance to the compressed sensingapplications.

1.4 Three dimensions

There are three dimension concepts that are involved in various coding theoremsthroughout this thesis, namely the information dimension, the Minkowski dimension,and the MMSE dimension, summarized in Chapters 2, 3 and 4, respectively. Theyall measure how fractal subsets or probability measures are in Euclidean spaces incertain senses.

In fractal geometry, the Minkowski dimension (also known as the box-countingdimension) [18] gauges the fractality of a subset in metric spaces, defined as theexponent with which the covering number grows. The relevance of Minkowski dimen-sion to compressed sensing are twofold: First, subsets of low Minkowski dimensioncan be linearly embedded into low-dimensional Euclidean spaces [19, 20, 21].2 Thisresult is further improved and extended to a probabilistic setup in Chapter 5. Sec-ond, Minkowski dimension never increases under Lipschitz mappings. Consequently,knowledge about Minkowski dimension provides converse results for analog compres-sion when the decompressor are constrained to be Lipschitz.

Another key concept in fractal geometry is the information dimension (also knownas the entropy dimension [23]) defined by Alfred Renyi in 1959 [24]. It measures therate of growth of the entropy of successively finer discretizations of a probability dis-tribution. Although a fundamental information measure, the Renyi dimension is farless well-known than either the Shannon entropy or the Renyi entropy. Renyi showedthat, under certain conditions, the information dimension of an absolutely continuousn-dimensional probability distribution is n. Hence he remarked in [24] that “... thegeometrical (or topological) and information-theoretical concepts of dimension coin-cide for absolutely continuous probability distributions”. However, as far as we know,the operational role of Renyi information dimension has not been addressed beforeexcept in the work of Kawabata and Dembo [25] and Guionnet and Shlyakhtenko [26],which relates it to the rate-distortion function and mutual information, respectively.In this thesis we provide several new operational characterizations of information di-mension in the context of lossless compression of analog sources. In Chapter 3 weshow that information dimension is connected to how product measures are concen-

2This connection has also been recognized in [22].

4

trated on subsets of low Minkowski dimension. Coupled with the linear embeddabilityresult mentioned above, this connection suggests the almost lossless compressibilityof memoryless sources via linear encoders, which makes noiseless compressed sensingpossible. Moreover, Chapter 5 relates information dimension to the fundamental limitof almost lossless compression of analog sources under various regularity constraintsof the encoder/decoder.

The MMSE dimension is an information measure for probability distributions thatgoverns the asymptotics of the minimum mean-square error (MMSE) of estimatinga random variable corrupted by additive Gaussian noise when the signal to noiseratio (SNR) or the sample size is large. This concept is closely related to informationdimension, and will be useful when we discuss noisy compressed sensing in Chapter 6.

1.5 Main contributions

Compressed sensing, as an analog compression paradigm, imposes two basic re-quirements: the linearity of the encoder and the robustness of the decoder; the ra-tionale is that low complexity of encoding operations and noise resilience of decodingoperations are indispensable in dealing with analog sources. To better understandthe fundamental limits imposed by the requirements of low complexity and noise re-silience, it is pedagogically sound to study them both separately and jointly as wellas in a more general framework than compressed sensing.

Motivated by this observation, in Chapter 5 we introduce the framework of al-most lossless analog compression as a Shannon-theoretic formulation of noiseless com-pressed sensing. Abstractly, the approach boils down to probabilistic dimension re-duction with smooth embedding. Under regularity conditions on the encoder orthe decoder, various coding theorems for the minimal measurement rate are derivedinvolving the information dimension of the input distribution. The most interestingregularity constraints are the linearity of the compressor and Lipschitz continuity (ro-bustness) of the decompressor. The fundamental limits when these two constraintsare imposed either separately or simultaneously are all considered. For memorylessdiscrete-continuous mixtures that are of the form (1.4), we show that the minimalmeasurement rate is given by the input information dimension, which coincides withthe weight γ of the absolutely continuous part. Moreover, the Lipschitz constant ofthe decoder can be chosen independent of n, as a function of the gap between themeasurement rate and γ. This resolves the optimal phase transition threshold of errorprobability in noiseless compressed sensing.

Chapter 6 deals with the noisy case where the measurements are corrupted byadditive noise. We consider three formulations of noise sensitivity, correspondingto optimal nonlinear, optimal linear and random linear (with i.i.d. entries) encodersand the associated optimal decoders, respectively. In the case of memoryless inputprocesses, we show that for any input distribution, the phase transition threshold foroptimal encoding is given by the input information dimension. Moreover, this resultalso holds for discrete-continuous mixtures with optimal linear encoders and Gaussianrandom linear encoders. Invoking the results on MMSE dimension in Chapter 4,

5

we show that the calculation of the reconstruction error with random measurementmatrices based on heuristic replica methods in [7] predicts the correct phase transitionthreshold. These results also serve as a rigorous verification of the replica calculationsin [7] in the high-SNR regime (up to o(σ2) as the noise variance σ2 vanishes).

In Chapter 7, we compare the optimal phase transition threshold to the subop-timal threshold of several practical reconstruction algorithms under various inputdistributions. In particular, we show that the thresholds achieved by decoders basedon linear-programming [16] and the AMP decoder [15, 6] lie far from the optimalboundary, especially in the highly sparse regime that is most relevant to compressedsensing applications.

6

Chapter 2

Renyi information dimension

This chapter is devoted to an overview of Renyi information dimension and itsproperties. In particular, in Section 2.5 we give a novel interpretation in terms ofthe entropy rate of the dyadic expansion of the random variable. We also discuss theconnection between information dimension and rate-distortion theory established byKawabata and Dembo [25], as well as the high-SNR asymptotics of mutual informa-tion with additive noise due to Guionnet and Shlyakhtenko [26]. The material in thischapter has been presented in part in [27, 5, 28, 29]; Theorems 1 – 3 and the contentin Section 2.9 are new.

2.1 Definitions

A key concept in fractal geometry, in [24] Renyi defined the information dimension(also known as the entropy dimension [23]) of a probability distribution. It measuresthe rate of growth of the entropy of successively finer discretizations.

Definition 1 (Information dimension). Let X be an arbitrary real-valued randomvariable. For m ∈ N, denote the uniformly quantized version of X by1

〈X〉m ,bmXcm

. (2.1)

Define

d(X) = lim infm→∞

H (〈X〉m)

logm(2.2)

and

d(X) = lim supm→∞

H (〈X〉m)

logm, (2.3)

where d(X) and d(X) are called lower and upper information dimensions of X re-spectively. If d(X) = d(X), the common value is called the information dimension of

1Throughout this thesis, 〈·〉m denotes the quantization operator, which can also be applied toreal numbers, vectors or subsets of Rn componentwise.

7

X, denoted by d(X), i.e.

d(X) = limm→∞

H (〈X〉m)

logm(2.4)

Renyi also defined the “entropy of dimension d(X)” as

H(X) = limm→∞

(H(〈X〉m)− d(X) logm) , (2.5)

provided the limit exists.

Definition 1 can be readily extended to random vectors, where the floor functionb·c is taken componentwise [24, p. 208]. Since d(X) only depends on the distributionof X, we also denote d(PX) = d(X). Similar convention also applies to entropy andother information measures.

Apart from discretization, information dimension can be defined from a moregeneral viewpoint: the mesh cubes of size ε > 0 in Rk are the sets Cz,ε =

∏kj=1[zjε, (zj+

1)ε) for zk ∈ Zk. For any ε > 0, the collection Cz,ε : z ∈ Zk partitions Rk. Hencefor any probability measure µ on Rk, this partition generates a discrete probabilitymeasure µε on Zk by assigning µε(z) = µ(Cz,ε). Then the information dimensionof µ can be expressed as

d(X) = limε↓0

H (µε)

log 1ε

. (2.6)

It should be noted that there exist alternative definitions of information dimensionin the literature. For example, in [30] the lower and upper information dimensionsare defined by replacing H (µε) with the ε-entropy Hε (µ) [31, 32] with respect to the`∞ distance. This definition essentially allows unequal partition of the whole space,since Hε (µ) ≤ H (µε). However, the resulting definition is equivalent (see Theorem8). As another example, the following definition is adopted in [33, Definition 4.2],

d(X) = limε↓0

E [log µ(Bp(X, ε))]

log ε(2.7)

= limε↓0

E[logP

X ∈ Bp(X, ε)|X

]log ε

, (2.8)

where µ denotes the distribution of X, X is an independent copy of X and Bp(x, ε)is the `p-ball2 of radius ε centered at x, with the `p norm on Rn defined as

‖xn‖p =

(∑n

i=1 |xi|p)1p , p ∈ [1,∞).

max|x1|, . . . , |xn|, p =∞.(2.9)

2By the equivalence of norms on finite-dimensional space, the the value of the limit on the right-hand side of (2.7) is independent of the underlying norm.

8

Note that using (2.7), information dimension can be defined for random variablesvalued on an arbitrary metric space. The equivalence of (2.7) to (2.4) is shown inAppendix A.1.

2.2 Characterizations and properties

The lower and upper information dimension of a random variable might not alwaysbe finite, because H(〈X〉m) can be infinity for all m. However, as pointed out in [24],if the mild condition

H(bXnc) <∞ (2.10)

is satisfied, we have0 ≤ d(Xn) ≤ d(Xn) ≤ n. (2.11)

The necessity of (2.10) and its equivalence to finite mutual information with additivenoise is shown next in Theorem 1. In the scalar case, a sufficient condition for finiteinformation dimension is the existence of the Shannon transform [34, Definition 2.12],i.e.,

E [log(1 + |X|)] <∞. (2.12)

See Appendix A.2 for a proof. Consequently, if E [|X|ε] < ∞ for some ε > 0, thend(X) <∞.

Theorem 1. Let NnG be standard normal distributed independent of Xn. Define

I(Xn, snr) = I(Xn;√snrXn +Nn

G). (2.13)

Then the following are equivalent:

1) 0 ≤ d(Xn) ≤ d(Xn) ≤ n;

2) H(bXnc) <∞;

3) E[log 1

PXn (Bp(Xn,ε))

]<∞ for some ε > 0;

4) I(Xn, snr) <∞ for some snr > 0;

5) I(Xn, snr) <∞ for all snr > 0.

If H(bXnc) =∞, thend(Xn) =∞. (2.14)

Proof. See Appendix A.2.

To calculate the information dimension in (2.2) and (2.3), it is sufficient to restrictto the exponential subsequence m = 2l, as a result of the following proposition.

9

Lemma 1. Denote[·]l , 〈·〉2l . (2.15)

Then

d(X) = lim infl→∞

H([X]l)

l, (2.16)

d(X) = lim supl→∞

H([X]l)

l. (2.17)


By analogous arguments that lead to Theorem 1, we have

Lemma 2. d(X) and d(X) are unchanged if rounding or ceiling functions are usedin Definition 1.


The following lemma states additional properties of information dimension, thelast three of which are inherited from Shannon entropy. For simplicity we assumethat all information dimensions involved exist and are finite.

Lemma 3.

• Translation-invariance: for all xn ∈ Rn,

d(xn +Xn) = d(Xn). (2.18)

• Scale-invariance: for all α 6= 0,

d(αXn) = d(Xn). (2.19)

• If Xn and Y n are independent, then3

maxd(Xn), d(Y n) ≤ d(Xn + Y n) (2.20)

≤ d(Xn) + d(Y n). (2.21)

• If Xi are independent and d(Xi) exists for all i, then

d(Xn) =n∑i=1

d(Xi). (2.22)

• If Xn, Y n, Zn are independent, then

d(Xn + Y n + Zn) + d(Zn) ≤ d(Xn + Zn) + d(Y n + Zn). (2.23)

3The upper bound (2.21) can be improved by defining conditional information dimension.

10

Proof. Appendix A.4.

The scale-invariance property of information dimension in (2.19) can in fact begeneralized to invariance under (the pushforward of) all bi-Lipschitz functions, aproperty shared by Minkowski and packing dimension [35, Exercise 7.6]. The proofis straightforward in view of the alternative definition of information dimension in(2.7).

Theorem 2. For any Lipschitz function f : R→ R, the (lower and upper) informa-tion dimension of X do not exceed those of f(X). Consequently, for any bi-Lipschitzfunction f : R→ R, the (lower and upper) information dimension of X coincide withthose of f(X).

To conclude this section, we discuss the regularity of information dimension as afunctional of probability distributions. Similar to the Shannon entropy, informationdimension is also discontinuous with respect to the weak-* topology. This can beverified using Theorem 4 presented in the next section, which shows that the infor-mation dimension of discrete and absolutely continuous distributions are zero and onerespectively. However, the next result shows that information dimension is Lipschitzcontinuous with respect to the total variation distance.

Theorem 3. Let Xn and Y n be random vectors such that H(bXnc) and H(bY nc)are both finite. Then

max|d(Xn)− d(Y n)|, |d(Xn)− d(Y n)| ≤ nTV(PXn , PY n). (2.24)

Proof. Let ε = TV(PXn , PY n). Since dropping the integer part does not change theinformation dimension, we shall assume that Xn and Y n are both valued in the unithypercube [0, 1]n. Let m ∈ N. Since the total variation distance, as an f -divergence,satisfies the data processing inequality [36], we have TV(P[Xn]m , P[Y n]m) ≤ ε. By thevariational characterization of the total variation distance [37], we have

TV(P[Xn]m , P[Y n]m) = minP[Xn]m[Y n]m

P [Xn]m 6= [Y n]m, (2.25)

where the minimum is taken with respect to all joint distributions with the pre-scribed marginals P[Xn]m and P[Y n]m . Now let [Xn]m and [Y n]m be jointly distributedaccording to the minimizer of (2.25). Then

H([Xn]m) ≤ H([Xn]m|[Y n]m) +H([Y n]m) (2.26)

≤ H([Xn]m|[Y n]m, [Xn]m 6= [Y n]m)ε+H([Y n]m) (2.27)

≤ nmε+H([Y n]m) (2.28)

where (2.27) and (2.28) are due to (2.25) and that Xn ∈ [0, 1]n respectively. Dividingboth sides of (2.28) by m and taking lim inf and lim sup with respect to m→∞, weobtain (2.24).

11

2.3 Evaluation of information dimension

By the Lebesgue Decomposition Theorem [38], a probability distribution can beuniquely represented as the mixture

ν = pνd + qνc + rνs, (2.29)

where

• p+ q + r = 1, p, q, r ≥ 0.• νd is a purely atomic probability measure (discrete part).• νc is a probability measure absolutely continuous with respect to Lebesgue

measure, i.e., having a probability density function (continuous part4).• νs is a probability measure singular with respect to Lebesgue measure but with

no atoms (singular part).

As shown in [24], the information dimension for the mixture of discrete and abso-lutely continuous distribution can be determined as follows (see [24, Theorem 1 and3] or [17, Theorem 1, pp. 588 – 592]):

Theorem 4. Let X be a random variable such that H(bXc) < ∞. Assume thedistribution of X can be represented as

ν = (1− ρ)νd + ρνc, (2.30)

where νd is a discrete measure, νc is an absolutely continuous measure and 0 ≤ ρ ≤ 1.Then

d(X) = ρ. (2.31)

Furthermore, given the finiteness of H(νd) and h(νc), H(X) admits a simple formula

H(X) = (1− ρ)H(νd) + ρh(νc) + h(ρ), (2.32)

where H(νd) is the Shannon entropy of νd, h(νc) is the differential entropy of νc, andh(ρ) , ρ log 1

ρ+ (1− ρ) log 1

1−ρ is the binary entropy function.

Some consequences of Theorem 4 are as follows: as long as H(bXc) <∞,

1. X is discrete: d(X) = 0, and H(X) coincides with the Shannon entropy of X.

2. X is continuous: d(X) = 1, and H(X) is equal to the differential entropy of X.

3. X is discrete-continuous-mixed: d(X) = ρ, and H(X) is the weighted sum ofthe entropy of discrete and continuous parts plus a term of h2(ρ).

For mixtures of countably many distributions, we have the following theorem.

4In measure theory, sometimes a measure is called continuous if it does not have any atoms, anda singular measure is called singularly continuous. Here we say a measure continuous if and only ifit is absolutely continuous.

12

Theorem 5. Let Y be a discrete random variable with H(Y ) < ∞. If d(PX|Y=i)exists for all i, then d(X) exists and is given by d(X) =

∑∞i=1 PY (i)d(PX|Y=i). More

generally,

d(X) ≤∞∑i=1

PY (i)d(PX|Y=i), (2.33)

d(X) ≥∞∑i=1

PY (i)d(PX|Y=i). (2.34)

Proof. For any m, the conditional distribution of 〈X〉m given Y = i is the same as⟨X(i)

⟩m

. Then

H(〈X〉m|Y ) ≤ H(〈X〉m) ≤ H(〈X〉m|Y ) +H(Y ), (2.35)

where

H (〈X〉m|Y ) =∞∑i=1

PY (i)H (〈X〉m|Y = i) . (2.36)

Since H(Y ) < ∞, dividing both sides of (2.35) by logm and sending m → ∞ yields(2.33) and (2.34).

To summarize, when X has a discrete-continuous mixed distribution, the infor-mation dimension of X is given by the weight of the continuous part. When thedistribution of X has a singular component, its information dimension does not ad-mit a simple formula in general. For instance, it is possible that d(X) < d(X) (see[24, p. 195] or the example at the end of Section 2.5). However, for the importantclass of self-similar singular distributions, the information dimension can be explicitlydetermined. See Section 2.8.

2.4 Information dimension of order α

With Shannon entropy replaced by Renyi entropy in (2.2) – (2.3), the generalizednotion of information dimension of order α can be defined similarly as follows [39, 17]:5

Definition 2 (Information dimension of order α). Let α ∈ [0,∞]. The lower andupper dimensions of X of order α are defined as

dα(X) = lim infm→∞

Hα (〈X〉m)

logm(2.37)

and

dα(X) = lim supm→∞

Hα (〈X〉m)

logm(2.38)

5dα is also known as the Lα dimension or spectrum [23, 33]. In particular, d2 is called thecorrelation dimension.

13

respectively, where Hα(Y ) denotes the Renyi entropy of order α of a discrete randomvariable Y with probability mass function py : y ∈ Y, defined as

Hα(Y ) =

∑

y∈X py log 1py, α = 1,

log 1maxy∈Y

py, α =∞,

11−α log

(∑y∈Y p

αy

), α 6= 1,∞.

(2.39)

If dα(X) = dα(X), the common value is called the information dimension of X oforder α, denoted by dα(X). The “the entropy of X of order α and dimension dα(X)”as

Hα(X) = limm→∞

(Hα(〈X〉m)− dα(X) logm) , (2.40)

provided the limit exists.

Analogous to (2.7), dα has the following “continuous” version of definition that isequivalent to Definition 2 (see, for example, [21, Section 2.1] or [40, Definition 2.6]):

dα(X) = limε↓0

logE [µ(Bp(X, ε))α−1]

(α− 1) log ε. (2.41)

As a consequence of the monotonicity of Renyi entropy, information dimensionsof different orders satisfy the following result.

Lemma 4. For α ∈ [0,∞], dα and dα both decrease with α. Define

d(X) = limα↑1

dα(X). (2.42)

Then0 ≤ d∞ ≤ lim

α↓1dα ≤ d ≤ d ≤ d ≤ d0. (2.43)

Proof. All inequalities follow from the fact that for a fixed random variable Y , Hα(Y )decreases with α in [0,∞].

For dimension of order α we highlight the following result from [39].

Theorem 6 ([39, Theorem 3]). Let X be a random variable whose distribution hasLebesgue decomposition as in (2.29). Assume that Hα(bXc) <∞. Then

1. For α > 1: if p > 0, that is, X has a discrete component, then dα(X) = 0 andHα(X) = Hα(νd) + α

1−α log p.

2. For α < 1: if q > 0, that is, X has an absolutely continuous component, thendα(X) = 1 and Hα(X) = hα(νc) + α

1−α log q, where hα(νc) is the differentialRenyi entropy of νc, defined in terms of its density fc as

hα(νc) =1

1− αlog

∫fαc (x)dx. (2.44)

14

As a consequence of the monotonicity, α 7→ dα(X) can have at most countablyinfinite discontinuities. Remarkably, the only possible discontinuity is at α = 1, as aresult of the following inequality due to Beck [41]:

α′ − 1

α′dα′(X) ≥ α− 1

αdα(X), α < α′, α, α′ 6= 0, 1. (2.45)

Recall that d1(X) = d(X). According to Theorem 6. for discrete-continuous mixeddistributions, dα(X) = d(X) = 1 for all α < 1, while d(X) equals to the weight of thecontinuous part. However, for Cantor distribution (see Section 2.8), dα(X) = d(X) =d(X) = log3 2 for all α.

2.5 Interpretation of information dimension as en-

tropy rate

Let (x)i denote the ith bit in the binary expansion of 0 ≤ x < 1, that is,

(x)i = 2i([x]i − [x]i−1) =⌊2ix⌋− 2⌊2i−1x

⌋∈ Z2. (2.46)

Let X ∈ [0, 1] a.s. Observe that H(〈X〉m) ≤ logm, since the range of 〈X〉m containsat most m values. Then 0 ≤ d(X) ≤ d(X) ≤ 1. The dyadic expansion of X can bewritten as

X =∞∑j=1

(X)j2−j, (2.47)

where each (X)j is a binary random variable. Therefore there is a one-to-one corre-spondence between X and the binary random process (X)j : j ∈ N. Note that thepartial sum in (2.47) is

[X]i =i∑

j=1

(X)j2−j, (2.48)

and [X]i and ((X)1, . . . , (X)i) are in one-to-one correspondence, therefore

H([X]i) = H ((X)1, . . . , (X)i) . (2.49)

By Lemma 1, we have

d(X) = lim infi→∞

H ((X)1, . . . , (X)i)

i, (2.50)

d(X) = lim supi→∞

H ((X)1, . . . , (X)i)

i. (2.51)

Thus the information dimension of X is precisely the entropy rate of its dyadic ex-pansion, or the entropy rate of any M -ary expansion of X, divided by logM .

15

This interpretation of information dimension enables us to gain more intuitionabout the result in Theorem 4. When X has a discrete distribution, its dyadicexpansion has zero entropy rate. When X is uniform on [0, 1], its dyadic expansion isi.i.d. equiprobable, and therefore it has unit entropy rate in bits. If X is continuous,but nonuniform, its dyadic expansion still has unit entropy rate. Moreover, from(2.50), (2.51) and Theorem 4 we have

limi→∞

D ((X)1, . . . , (X)i || equiprobable) = −h(X), (2.52)

where D(· || ·) denotes the relative entropy and the differential entropy is h(X) ≤ 0since X ∈ [0, 1] a.s. The information dimension of a discrete-continuous mixture isalso easily understood from this point of view, because the entropy rate of a mixedprocess is the weighted sum of entropy rates of each component. Moreover, ran-dom variables whose lower and upper information dimensions differ can be easilyconstructed from processes with different lower and upper entropy rates.

2.6 Connections with rate-distortion theory

The asymptotic behavior of the rate-distortion function, in particular, the asymp-totic tightness of the Shannon lower bound in the high-rate regime, has been ad-dressed in [42, 43] for continuous sources. In [25] Kawabata and Dembo generalizedit to real-valued sources that do not necessarily possess a density, and showed thatthe information dimension plays a central role.

The asymptotic tightness of the Shannon lower bound in the high-rate regime isshown by the following result.

Theorem 7 ([43]). Let X be a random variable on the normed space (Rk, ‖·‖) with adensity such that h(X) > −∞. Let the distortion function be ρ(x, y) = ‖x− y‖r withr > 0, and the single-letter rate-distortion function is given by

RX(D) = infE‖Y−X‖r≤D

I(X;Y ). (2.53)

Suppose that there exists an α > 0 such that E ‖X‖α < ∞. Then RX(D) ≥ RL(D)and

limD→0

(RX(D)−RL(D)) = 0, (2.54)

where the Shannon lower bound RL(D) takes the following form

RL(D) =k

rlog

1

D+ h(X) +

k

rlog

k

re− k

rΓ

(k

r

)log Vk, (2.55)

where Vk denotes the volume of the unit ball Bk = x ∈ Rk : ‖x‖ ≤ 1.

Special cases of Theorem 7 include:

16

• MSE distortion and scalar source: ρ(x, y) = (x − y)2, ‖x‖ = |x|, r = 2, k = 1and Vk = 2. Then the Shannon lower bound takes the familiar form

RL(D) = h(X)− 1

2log(2πeD). (2.56)

• Absolute distortion and scalar source: ρ(x, y) = |x− y|, ‖x‖ = |x|, r = 1, k = 1and Vk = 2. Then

RL(D) = h(X)− log(2eD). (2.57)

• `∞ norm and vector source: ρ(x, y) = ‖x− y‖∞, ‖x‖ = ‖x‖∞, r = 1 andVk = 2k. Then

RL(D) = h(X) + log

(1

k!

(k

2eD

)k). (2.58)

For general sources, Kawabata and Dembo introduced the concept of rate-distortion dimension in [25]. The rate-distortion dimension of a measure µ (or arandom variable X with the distribution µ) on the metric space (Rk, d) is defined asfollows:

dimR(X) = lim supD↓0

RX(D)1r

log 1D

, (2.59)

dimR(X) = lim infD↓0

RX(D)1r

log 1D

, (2.60)

where RX(D) is the rate-distortion function of X with distortion function ρ(x, y) =d(x, y)r and r > 0. Then under the equivalence of the metric and the `∞ norm as in(2.61), the rate-distortion dimension coincides with the information dimension of X:

Theorem 8 ([25, Proposition 3.3]). Consider the metric space (Rk, d). If there existsa1, a2 > 0, such that for all x, y ∈ Rk,

a2 ‖x− y‖∞ ≤ d(x, y) ≤ a1 ‖x− y‖∞ , (2.61)

then

dimR(X) = d(X), (2.62)

dimR(X) = d(X), (2.63)

Moreover, (2.62) and (2.63) hold even if the ε-entropy Hε(X) instead of the rate-distortion function RX(ε) is used in the definition of dimR(X) and dimR(X).

17

In particular, consider the special case of scalar source and MSE distortiond(x, y) = |x− y|2. Then whenever d(X) exists and is finite,6

RX(D) =d(X)

2log

1

D+ o (logD), D → 0. (2.64)

Therefore d(X)2

is the scaling constant of RX(D) with respect to log 1D

in the high-rateregime. This fact gives an operational characterization of information dimension inShannon theory. Note that in the most familiar cases we can sharpen (2.64) to showthe following: as D → 0,

• X is discrete and H(X) <∞: RX(D) = H(X) + o (1).• X is continuous and h(X) > −∞: RX(D) = 1

2log 1

2πeD+ h(X) + o (1).

• X is discrete-continuous mixed: if PX is given by the mixture (2.30) with finitevariance, then [42, Theorem 1]

RX(D) =ρ

2log

1

D+ H(X) + o(1). (2.65)

where H(X) is the entropy of X of dimension ρ given in (2.32). In view of thedefinition of H in (2.5), we see that for mixtures the rate-distortion functioncoincides with the entropy of the quantized version up to o(1) term, i.e., ifm = 1√

D(1 + o(1)), then RX(D)−H(〈X〉m) = o(1).

Remark 1. Information dimension is also related to the fine quantization of randomvariables. See the notion of quantization dimension introduced by Zador [44] anddiscussed in [45, Chapter 11].

2.7 High-SNR asymptotics of mutual information

The high-SNR asymptotics of mutual information with additive noise is governedby the input information dimension. Recall the notation I(X, snr) defined in (2.13)which denotes the mutual information between X and

√snrX + NG, where NG is

standard additive Gaussian noise. Using [26, Theorems 2.7 and 3.1], Theorem 1 andthe equivalent definition of information dimension in (2.7), we can relate the scalinglaw of mutual information under weak noise to the information dimension:

Theorem 9. For any random variable X, Then

lim supsnr→∞

I(X, snr)12

log snr= d(X), (2.66)

lim infsnr→∞

I(X, snr)12

log snr= d(X). (2.67)

6We use the following asymptotic notations: f(x) = O (g(x)) if lim sup |f(x)||g(x)| < ∞, f(x) =

Ω(g(x)) if g(x) = O (f(x)), f(x) = Θ(g(x)) if f(x) = O (g(x)) and f(x) = Ω(g(x)), f(x) = o(g(x))

if lim |f(x)||g(x)| = 0, f(x) = ω(g(x)) if g(x) = o(f(x)).

18

The original results in [26] are proved under the assumption of (2.12), which canin fact be dropped. Also, the authors of [26] did not point out that the “classicalanalogue” of the free entropy dimension they defined in fact coincides with Renyi’sinformation dimension. For completeness, we give a proof in Appendix A.5, whichincludes a non-asymptotic version of Theorem 9.

Theorem 9 also holds for non-Gaussian noise N with finite non-Gaussianness:

D(N) , D(N ||ΦN) <∞, (2.68)

where ΦN is Gaussian distributed with the same mean and variance as N . This isbecause

I(X;√snrX + ΦN) ≤ I(X;

√snrX +N) ≤ I(X;

√snrX + ΦN) +D(N). (2.69)

Moreover, Theorem 9 can be readily extended to random vectors.

Remark 2. Consider the MSE distortion criterion. As a consequence of Theorems9 and 8, when d(X) exists, we have

I(X, snr) =d(X)

2log snr + o(log snr), (2.70)

RX(D) =d(X)

2log

1

D+ o

(log

1

D

), (2.71)

when snr → ∞ and D → 0 respectively. Therefore in the minimization problem(2.53), choosing the forward channel PY |X according to Y = X+

√DNG is first-order

optimal. However, it is unclear whether such a choice is optimal in the followingstronger sense:

limD↓0

RX(D)− I(X;X +√DNG) = 0. (2.72)

2.8 Self-similar distributions

2.8.1 Definitions

An iterated function system (IFS) is a family of contractions F1, . . . , Fm on Rn,where 2 ≤ m < ∞, and Fj : Rn → Rn satisfies ‖Fj(x)− Fj(y)‖2 ≤ rj ‖x− y‖2

for all x, y ∈ Rn with 0 < rj < 1. By [46, Theorem 2.6], given an IFS there isa unique nonempty compact set E, called the invariant set of the IFS, such thatE =

⋃mj=1 Fj(E). The corresponding invariant set is called self-similar, if the IFS

consists of similarity transformations, that is,

Fj(x) = rjOjx+ wj (2.73)

19

with Oj an orthogonal matrix, rj ∈ (0, 1) and wj ∈ Rn, in which case

‖Fj(x)− Fj(y)‖2

‖x− y‖2

= rj (2.74)

is called the similarity ratio of Fj. If rj is constant for all j, we say that E ishomogeneous [47] (or equicontractive [48]). Self-similar sets are usually fractal [18].For example, consider the IFS on R with

F1(x) =x

3, F2(x) =

x+ 2

3. (2.75)

The resulting invariant set is the middle-third Cantor set.Now we define measures supported on a self-similar set E associated with the IFS

F1, . . . , Fm.

Definition 3. Let P = p1, . . . , pm be a probability vector and F1, . . . , Fm anIFS. A Borel probability measure µ is called self-similar if

µ(A) =m∑j=1

pjµ(F−1j (A)), (2.76)

for all Borel subsets A.

Classical results from fractal geometry ([49], [46, Theorem 2.8]) assert the exis-tence and uniqueness of µ, which is supported on the self-similar invariance set Ethat satisfies µ(Fj(E)) = pj for each j. The usual Cantor distribution [38] can bedefined through the IFS in (2.75) and P = 1/2, 1/2.

Remark 3. The fixed-point equation in (2.76) can be equivalently understood asfollows. Consider a random transformation on R defined by mapping x to Fj(x) withprobability pj. Then µ is the (unique) fixed point of this random transformation, i.e.,the input distribution µ induces the same output distribution. For one-dimensionalself-similar measure with IFS (2.73), the corresponding random transformation isgiven by

Y = RX +W (2.77)

where X is independent of (R,W ) and R is correlated with W according toP R = rj,W = wj = pj, j = 1, . . . ,m. It is easy to check that the distribution of

X =∞∑k=1

Wk

k−1∏i=1

Ri (2.78)

is the fixed-point of (2.77), where (Rk,Wk) in (2.77) is an i.i.d. sequence. In thehomogeneous case, (2.77) becomes

Y = rX +W (2.79)

20

and (2.78) simplifies to a randomly modulated geometric series7

X =∑k≥1

Wkrk−1, (2.80)

where Wk are i.i.d.From (2.80) we see that distributions defined by M -ary expansion with indepen-

dent digits can be equivalently defined via IFS, in which case r = 1M

and Wk is0, . . . ,M − 1-valued. For example, the ternary expansion of the Cantor distribu-tion consists of independent digits which take value 0 or 2 equiprobably.

2.8.2 Information dimension

It is shown in [23] that8 the information dimension (or dimension of order α) ofself-similar measures always exists.

Theorem 10. Let X be distributed according to µ in (2.76). Then the sequenceH([X]m)− 10 is superadditive. Consequently, d(X) exists and satisfies

d(X) = supm≥1

H([X]m)

m. (2.81)

Proof. See [23, Theorem 1.1, Lemma 2.2 and Equation (3.10)].

In spite of the general existence result in [23], there is no known formula forthe information dimension unless the IFS satisfies certain separation conditions, forexample [18, p. 129]:

• strong separation condition: Fj(E) : j = 1, . . . ,m are disjoint.• open set condition: there exists a nonempty bounded open set U ⊂ Rn, such

that⋃j Fj(U) ⊂ U and Fi(U) ∩ Fj(U) = ∅ for i 6= j.

Note that the open set condition is weaker than the strong separation condition. Thefollowing lemma (proved in Appendix A.6) gives a sufficient condition for a self-similarhomogeneous measure (or random variables of the form (2.80)) to satisfy the openset condition.

Lemma 5. Let the IFS F1, . . . , Fm consist of similarity transformations Fj(x) =rx + wj, where 0 < r < 1 and W , w1, . . . , wm ⊂ R. Then the open set conditionis satisfied if

r ≤ m(W)

m(W) + M(W), (2.82)

where the minimum distance and spread of W are defined by

m(W) , minwi 6=wj∈W

|wi − wj|. (2.83)

7When Wk is binary, (2.80) is known as the Bernoulli convolution, an important object in fractalgeometry and ergodic theory [50].

8The results in [23] are more general, encompassing nonlinear IFS as well.

21

andM(W) , max

wi,wj∈W|wi − wj| (2.84)

respectively.

The next result gives a formula for the information dimension of a self-similarmeasure µ with IFS satisfying the open set condition:

Theorem 11 ([51], [18, Chapter 17].). Let µ be the self-similar measure (2.76) gen-erated by the IFS (2.73) which satisfies the open set condition. Then

d(µ) =H(P )∑m

j=1 pj log 1rj

(2.85)

and

dα(µ) =β(α)

α− 1, α 6= 1, (2.86)

where β(α) is the unique solution to the following equation of β:

m∑j=1

pαj rβj = 1 (2.87)

In particular, for self-similar homogeneous µ with rj = r for all j,

dα(µ) =Hα(P )

log 1r

, α ≥ 0. (2.88)

Theorem 11 can be generalized to

• non-linear IFS, with similarity ratio rj in (2.85) replaced by the Lipschitz con-stant of Fj [52];

• random variables of the form (2.78) where the sequence (Wk) is stationaryand ergodic, with H(P ) in (2.85) replaced by the entropy rate [25, Theorem4.1].

• IFS with overlaps [53].

The open set condition implies that∑m

j=1 Leb(Fj(U)) = Leb(U)∑m

j=1 rnj ≤

Leb(U). Since 0 < Leb(U) <∞, it follows that

m∑j=1

rnj ≤ 1. (2.89)

In view of (2.85), we have

d(X) =nH(P )

H(P ) +D(P ||R), (2.90)

22

where R = (rn1 , . . . , rnm) is a sub-probability measure. Since D(P ||R) ≥ 0, we have

0 ≤ d(X) ≤ n, which agrees with (2.11). Note that as long as the open set condition issatisfied, the information dimension does not depend on the translates w1, . . . , wm.

To conclude this subsection, we note that in the homogeneous case, it can be shownthat (2.88) always holds with inequality without imposing any separation condition.

Lemma 6. For self-similar homogeneous measure µ with similarity ratio r,

dα(µ) ≤ Hα(P )

log 1r

, α ≥ 0. (2.91)


2.9 Information dimension under projection

Let A ∈ Rm×n with m ≤ n. Then for any Xn,

d(AXn) ≤ mind(Xn), rank(A). (2.92)

Understanding how the dimension of a measure behaves under projections is a basicproblem in fractal geometry. It is well-known that almost every projection preservesthe dimension, be it Hausdorff dimension (Marstrand’s projection theorem [35, Chap-ter 9]) or information dimension [33, Theorems 1.1 and 4.1]. However, computing thedimension for individual projections is in general difficult.

The preservation of information dimension of order 1 < α ≤ 2 under typicalprojections is shown in [33, Theorem 1], which is relevant to our investigation ofthe almost sure behavior of degrees of freedom. The proof of the following result ispotential-theoretic.

Lemma 7. Let α ∈ (1, 2] and m ≤ n. Then for almost every A ∈ Rm×n,

dα(AXn) = min dα(Xn),m , (2.93)

It is easy to see that (2.93) fails for information dimension (α = 1) [33, p. 1041]:let PXn = (1 − ρ)δ0 + ρQ, where Q is an absolutely continuous distribution on Rn

and ρ < mn

. Then d(Xn) = ρn and min dα(Xn),m = m. Note that AXn also hasa mixed distribution with an atom at zero of mass at least 1 − ρ. Then d(AXn) =ρm < m. Examples of nonpreservation of dα for 0 ≤ α < 1 or α > 2 are given in [33,Section 5].

A problem closely related to the degrees of freedom of interference channel is todetermine the dimension difference of a two-dimensional product measure under twoprojections [54]. To ease the exposition, we assume that all information dimension in

23

the sequel exist. Let p, q, p′, q′ be non-zero real numbers. Then

d(pX + qY )− d(p′X + q′Y )

≤ 1

2(d(pX + qY ) + d(pX + qY )− d(X)− d(Y )) (2.94)

≤ 1

2. (2.95)

where (2.94) and (2.95) follow from (2.20) and (2.21) respectively. Therefore thedimension of a two-dimensional product measure under two projections can differ byat most one half. However, the next result shows that if the coefficients are rational,then the dimension difference is strictly less than one half. The proof is based onnew entropy inequalities for sums of dilated independent random variables derived inAppendix A.8.

Theorem 12. Let p, p′, q, q′ be non-zero integers. Then

d(pX + qY )− d(p′X + q′Y ) ≤ 1

2− ε(p′q, pq′), (2.96)

where

ε(a, b) ,1

36(blog |a|c+ blog |b|c) + 74. (2.97)


If p′ = q′ = 1, for the special case of (p, q) = (1, 2) and (1,−1), the right-handside of (2.96) can be improved to 3

7and 2

5respectively. See Remark 32 at the end of

Appendix A.8.For irrational coefficients, (2.95) is tight in the following sense:

Theorem 13. Let p′qpq′

be irrational. Then

supX⊥⊥Y

d(pX + qY )− d(p′X + q′Y ) =1

2. (2.98)


24

Chapter 3

Minkowski dimension

In this chapter we introduce the notion of Minkowski dimension, a central conceptin fractal geometry which gauges the degree of fractality of sets and measures. Therelevance of Minkowski dimension to compressed sensing is mainly due to the linearembeddability of sets of low Minkowski dimension into low-dimensional Euclideanspaces. Section 3.1 gives the definition of Minkowski dimension and overviews itsproperties. Section 3.2 contains definitions and coding theorems of lossless Minkowskidimension compression, which are important intermediate results for Chapter 5. Newtype of concentration-of-measure type of results are proved for memoryless sources,which shows that overwhelmingly large probability is concentrated on subsets of low(Minkowski) dimension. The material in this chapter has been presented in part in[5].

3.1 Minkowski dimension of sets and measures

Definition 4 (Covering number). Let A be a nonempty bounded subset of the metricspace (X, d). For ε > 0, define NA(ε), the ε-covering number of A, to be the smallestnumber of ε-balls needed to cover A, that is,

NA(ε) = min

k : A ⊂

k⋃i=1

B(xi, ε), xi ∈ X

. (3.1)

Definition 5 (Minkowski dimensions). Let A be a nonempty bounded subset ofmetric space (X, d). Define the lower and upper Minkowski dimensions of A as

dimBA = lim infε→0

logNA(ε)

log 1ε

(3.2)

and

dimBA = lim supε→0

logNA(ε)

log 1ε

(3.3)

25

respectively. If dimBA = dimBA, the common value is called the Minkowski dimensionof A, denoted by dimBA.

The (ε-)Minkowski dimension of a probability measure is defined as the lowestMinkowski dimension among all sets with measure at least 1− ε [30].

Definition 6. Let µ be a probability measure on (Rn,BRn). Define the (ε-)Minkowskidimension of µ as

dimε

B(µ) = infdimB(A) : µ(A) ≥ 1− ε. (3.4)

It should be pointed out that the Minkowski dimension depends on the underlyingmetric. Nevertheless, equivalent metrics result in the same dimension. A few examplesare:

• dimBA = 0 for any finite set A.• dimBA = n for any bounded set A of nonempty interior in Euclidean space Rn.• Let C be the middle-third Cantor set in the unit interval. Then dimBC = log3 2

[18, Example 3.3].• dimB

1i

: i ∈ N

= 12> 0 [18, Example 3.5]. From this example wee see that

Minkowski dimension lacks certain stability properties one would expect of adimension, since it is often desirable that adding a countable set would have noeffect on the dimension. This property fails for Minkowski dimension. On thecontrary, we observe that Renyi information dimension exhibits stability withrespect to adding a discrete component as long as the entropy is finite. However,mixing any distribution with a discrete measure with unbounded support andinfinite entropy will necessarily result in infinite information dimension.

The upper Minkowski dimension satisfies the following properties (see [18, p. 48(iii) and p. 102 (7.9)]), which will be used in the proof of Theorem 14.

Lemma 8. For bounded sets A1, . . . , Ak,

dimB

k⋃i=1

Ai = maxi

dimBAi, (3.5)

dimBA1 × · · · × Ak ≤∑i

dimBAi. (3.6)

The following lemma shows that in Euclidean spaces, without loss of generalitywe can restrict attention to covering A with mesh cubes : For zn ∈ Zn,

Cm(zn) ,

xn ∈ Rn : [xn]m =

zn

2m

(3.7)

=n∏i=1

[zi2m,zi + 1

2m

). (3.8)

26

Since all the mesh cubes partition the whole space, to calculate the lower or upperMinkowski dimension of a set, it suffices to count the number of mesh cubes it inter-sects, hence justifying the name of box-counting dimension. Denote by NA(2−m) thesmallest number of mesh cubes of size 2−m that covers A, that is,

NA(2−m) = min

k : A ⊂

k⋃i=1

Cm(zi), zi ∈ Zn

(3.9)

= |z ∈ Zn : A ∩ Cm(z) 6= ∅| (3.10)

= |[A]m|, (3.11)

where [A]m , [x]m : x ∈ A.

Lemma 9. Let A be a bounded subset in (Rn, ‖·‖p), 1 ≤ p ≤ ∞. The Minkowskidimensions satisfy

dimBA = lim infm→∞

log |[A]m|m

(3.12)

and

dimBA = lim supm→∞

log |[A]m|m

. (3.13)

Proof. See Appendix C.2.

3.2 Lossless Minkowski-dimension compression

As a counterpart to lossless data compression, in this section we investigate theproblem of lossless Minkowski dimension compression for general sources, where theminimum ε-achievable rate is defined as RB(ε). This is an important intermediate toolfor studying the fundamental limits of lossless linear encoding and Lipschitz decodingin Chapter 5. We present bounds for RB(ε) for general sources, as well as tight resultsfor discrete-continuous mixtures and self-similar sources.

Consider a source Xn in Rn equipped with an `p-norm. We define the minimumε-achievable rate for Minkowski-dimension compression as follows:

Definition 7 (Minkowski-dimension compression rate). Let Xi : i ∈ N be astochastic process on (RN,B⊗N). Define

RB(ε) = lim supn→∞

dimε

B(PXn)

n(3.14)

= lim supn→∞

1

ninf

dimBS : S ⊂ Rn,P Xn ∈ S ≥ 1− ε. (3.15)

Remark 4 (Connections to data compression). Definition 7 deals with the minimumupper Minkowski dimension of high-probability events of source realizations. This can

27

be seen as a counterpart of lossless source coding, which seeks the smallest cardinalityof high-probability events. Indeed, when the source alphabet is some countable setX , the conventional minimum source coding rate is defined like in (3.15) replacingdimBS by log |S| (see Definition 12 in Section 5.2.1). For memoryless sources, thestrong source coding theorem states that

H(X) = limn→∞

1

ninf

log |S| : S ⊂ X n,P Xn ∈ S ≥ 1− ε, 0 < ε < 1. (3.16)

In general, 0 ≤ RB(ε) ≤ 1 for any 0 < ε ≤ 1. This is because for any n, thereexists a compact subset S ⊂ Rn, such that P Xn ∈ S ≥ 1 − ε, and dimBS ≤ n bydefinition. Several coding theorems for RB(ε) are given as follows.

Theorem 14. Suppose that the source is memoryless with distribution such thatd(X) <∞. Then

RB(ε) ≤ limα↑1

dα(X) (3.17)

andRB(ε) ≥ d(X) (3.18)

for 0 < ε < 1.

Proof. See Section 3.3.

For the special cases of discrete-continuous mixed and self-similar sources, we havethe following tight results.

Theorem 15. Suppose that the source is memoryless with a discrete-continuous mixeddistribution. Then

RB(ε) = ρ (3.19)

for all 0 < ε < 1, where ρ is the weight of the continuous part of the distribution. Ifd(X) is finite, then

RB(ε) = d(X). (3.20)

Theorem 16. Suppose that the source is memoryless with a self-similar distributionthat satisfies the strong separation condition. Then

RB(ε) = d(X) (3.21)

for all 0 < ε < 1.

Theorem 17. Suppose that the source is memoryless and bounded, whose M-aryexpansion defined in (2.47) consists of independent (not necessarily identically dis-tributed) digits. Then

RB(ε) = d(X) (3.22)

for all 0 < ε < 1.

28

When d(X) = ∞, RB(ε) can take any value in [0, 1] for all 0 < ε < 1. Such asource can be constructed using Theorem 15 as follows: Let the distribution of Xbe a mixture of a continuous and a discrete distribution with weights d and 1 − drespectively, where the discrete part is supported on N and has infinite entropy. Thend(X) =∞ by Theorem 1 and Theorem 4, but RB(ε) = d by Theorem 15.

3.3 Proofs

Before showing the converse part of Theorem 14, we state two lemmas which areof independent interest in conventional lossless source coding theory.

Lemma 10. Assume that Xi : i ∈ N is a discrete memoryless source with commondistribution P on the alphabet X . H(P ) < ∞. Let δ ≥ −H(P ). Denote by p∗n(P, δ)the block error probability of the optimal (n, b(H(X) + δ)nc)-code, i.e.,

p∗n(P, δ) , 1− sup P n(S) : |S| ≤ exp(b(H(X) + δ)nc) . (3.23)

Then for any n ∈ N,

p∗n(P, δ) ≤ exp[−nE0(P, δ)], (3.24)

p∗n(P, δ) ≥ 1− exp[−nE1(P, δ)], . (3.25)

where the exponents are given by

E0(P, δ) = infQ:H(Q)>H(P )+δ

D(Q||P ) (3.26)

= maxλ≥0

λ[H(P ) + δ −H 1

1+λ(P )], (3.27)

E1(P, δ) = infQ:H(Q)<H(P )+δ

D(Q||P ) (3.28)

= maxλ≥0

λ[H 1

1−λ(P )−H(P )− δ

]. (3.29)

Lemma 11. For δ ≥ 0,

min E0(P, δ), E1(P,−δ)

≥ min

1

8,1

2

[(δ − e−1 log e)+

log |X |

]2

log e. (3.30)


Lemma 10 shows that the error exponents for lossless source coding are not onlyasymptotically tight, but also apply to every block length. This has been shown forrates above entropy in [55, Exercise 1.2.7, pp. 41 – 42] via a combinatorial method.

29

A unified proof can be given through the method of Renyi entropy, which we omitfor conciseness. The idea of using Renyi entropy to study lossless source coding errorexponents was previously introduced by [56, 57, 58].

Lemma 11 deals with universal lower bounds on the source coding error exponents,in the sense that these bounds are independent of the source distribution. A betterbound on E0(P, δ) has been shown in [59]: for δ > 0

E0(P, δ) ≥ 1

2

(δ

log |X |

)2

log e. (3.31)

However, the proof of (3.31) was based on the dual expression (3.27) and a similarlower bound of random channel coding error exponent due to Gallager [60, Exercise5.23], which cannot be applied to E1(P, δ). Here we give a common lower bound onboth exponents, which is a consequence of Pinsker’s inequality [61] combined withthe lower bound on entropy difference by variational distance [55].

Proof of Theorem 14. (Converse) Let 0 < ε < 1 and abbreviate d(X) as d. SupposeRB(ε) < d − 4δ for some δ > 0. Then for sufficiently large n there exists Sn ⊂ Rn,such that P Xn ∈ Sn ≥ 1− ε and dimBS

n ≤ (d− 3δ)n.First we assume that the source has bounded support, that is, |X| ≤ K a.s. for

some K > 0. By Lemma 1, choose M such that for all m > M ,

H([X]m) ≥ (d− δ)m. (3.32)

By Lemma 9, for all n, there exists Mn > M , such that for all m > Mn,

|[Sn]m| ≤ 2m(dimB(Sn)+δn) ≤ 2mn(d−2δ), (3.33)

P [Xn]m ∈ [Sn]m ≥ P Xn ∈ Sn ≥ 1− ε. (3.34)

Then by (3.32), we have|[Sn]m| ≤ 2(H([X]m)−mδ)n. (3.35)

Note that [Xn]m is a memoryless source with alphabet size at most logK + m. ByLemmas 10 and Lemma 11, for all m,n, for all Anm such that |Anm| ≤ 2(H([X]m)−mδ)n,we have

P [Xn]m ∈ Anm ≤ 2−nE1(mδ) (3.36)

≤ 2−n

2

((mδ−e−1 log e)+

m+logK

)2

. (3.37)

Choose m and n so large that the right hand side of (3.37) is less than 1− ε. In thespecial case of Anm = [Sn]m, (3.37) contradicts (3.34) in view of (3.35).

Next we drop the restriction that X is almost surely bounded. Denote by µ thedistribution of X. Let K be so large that

p = µ(B(0, K)) > 1− δ. (3.38)

30

Denote the normalized restriction of µ on B(0, K) by µ1, that is,

µ1(A) =µ(A ∩B(0, K))

µ(B(0, K)), (3.39)

and the normalized restriction of µ on B(0, K)c by µ2. Then

µ = pµ1 + (1− p)µ2. (3.40)

By Theorem 5,

d ≤ pd(µ1) + (1− p)d(µ2) (3.41)

≤ pd(µ1) + (1− p), (3.42)

where d(µ2) ≤ 1 because of the following: since d is finite, we have E [log(1 + |X|)] <∞ in view of Theorem 1. Consequently, E

[log(1 + |X|)1|X|>K

]<∞, hence d(µ2) ≤

1.The distribution of X , X1X∈B(0,K) is given by

µ = pµ1 + (1− p)δ0, (3.43)

where δ0 denotes the Dirac measure with atom at 0. Then

d(X) = pd(µ1) (3.44)

> d− δ. (3.45)

where

• (3.44): by (3.43) and (2.35), since Theorem 4 implies d(δ0) = 0.• (3.45): by (3.42) and (3.38).

For T ⊂ 1, . . . , n, let SnT = xnT : xn ∈ Sn. Define

Sn =⋃

T⊂1,...,nxn ∈ Rn : xnT ∈ SnT , xnT c = 0. (3.46)

Then for all n,

PXn ∈ Sn

≥ P Xn ∈ Sn ≥ 1− ε, (3.47)

31

and

dimBSn

= maxT⊂1,...,n

dimBxn ∈ Rn : xnT ∈ SnT , xnT c = 0 (3.48)

≤ dimBSn (3.49)

≤ (d− 3δ)n (3.50)

< (d(X)− 2δ)n, (3.51)

where (3.51), (3.48) and (3.49) follow from (3.45), (3.5) and (3.6) respectively. Butnow (3.51) and (3.47) contradict the converse part of Theorem 14 for the boundedsource X, which we have already proved.

(Achievability) Recall thatd = lim

α↑1dα. (3.52)

We show that for all 0 < ε < 1, for all δ > 0, there exists N such that for all n > N ,there exists Sn ⊂ Rn with P Xn ∈ Sn ≥ 1 − ε and dimBS

n ≤ n(d + δ). ThereforeRB(ε) ≤ d readily follows.

Suppose for now that we can construct a sequence of subsets V nm ⊂ 2−mZn, such

that for any n the following holds:

|V nm| ≤ 2mn(d+δ), (3.53)∑

m≥1

P [Xn]m /∈ V nm <∞. (3.54)

Recalling the mesh cube defined in (3.8), we denote

Unm =

∑zn∈V nm

Cm(2mzn), (3.55)

T nm =⋂j≥m

Unj , (3.56)

T n =⋃m≥1

T nm = lim infm→∞

Unm. (3.57)

Then [Unm]m = V n

m. Now we claim that for each m, dimBTnm ≤ (d + δ)n and

P Xn ∈ T n = 1. First observe that for each j ≥ m, T nm ⊂ Unj , therefore Cj(2jzn) :

zn ∈ V nj covers T nm. Hence |[T nm]j| ≤ |V n

j | ≤ 2jn(d+δ), by (3.53). Therefore byLemma 9,

dimBTnm = lim sup

j→∞

log |[T nm]j|j

≤ n(d+ δ). (3.58)

32

By the Borel-Cantelli lemma, (3.54) implies that P Xn ∈ T n = 1. Let Sn =⋃Mm=1 T

nm where M is so large that P Xn ∈ Sn ≥ 1−ε. By the finite subadditivity1 of

upper Minkowski dimension in Lemma 8, dimBSn = maxm=1,...,M dimBT

nm ≤ n(d+ δ).

By the arbitrariness of δ, the ε-achievability of rate d is proved.Now let us proceed to the construction of the required V n

m. To that end, denoteby µm the probability mass function of [X]m. Let

V nm =

zn ∈ 2−mZn :

1

mnlog

1

µnm(zn)≤ d+ δ

(3.59)

=zn ∈ 2−mZn : µnm(zn) ≥ 2−mn(d+δ)

(3.60)

Then, immediately (3.53) follows from P [Xn]m ∈ V nm ≤ 1. Also, for α < 1,

P [Xn]m /∈ V nm

=∑

zn∈2−mZnµnm(zn)1zn /∈V nm (3.61)

≤∑

zn∈2−mZnµnm(zn)

(2−mn(d+δ)

µnm(zn)

)1−α

(3.62)

= 2−nm(1−α)(d+δ)∑

zn∈2−mZn[µnm(zn)]α (3.63)

= 2(1−α)(Hα([Xn]m)−mn(d+δ)) (3.64)

= 2n(1−α)(Hα([X]m)−md−mδ). (3.65)

where (3.65) follows from the fact that [Xn]m = [[X1]m, . . . , [Xn]m]T are i.i.d. and thejoint Renyi entropy is the sum of individual Renyi entropies. Hence

logP [Xn]m /∈ V nm

m≤ (1− α)

(Hα([X]m)

m− d− δ

)n. (3.66)

Choose 0 < α < 1 such that dα < d + δ/2, which is guaranteed to exist in view of(3.52) and the fact that dα is nonincreasing in α according to Lemma 4. Then

lim supm→∞

logP [Xn]m /∈ V nm

m

≤ (1− α)

(lim supm→∞

Hα([X]m)

m− d− δ

)n (3.67)

= (1− α)(dα − d− δ

)n (3.68)

≤ − (1− α)nδ/2 < 0, (3.69)

1Countable subadditivity fails for upper Minkowski dimension. Had it been satisfied, we couldhave picked Sn = Tn to achieve ε = 0.

33

Hence P [Xn]m /∈ V nm decays at least exponentially with m. Accordingly, (3.54)

holds, and the proof of RB(ε) ≤ d is complete.

Next we prove RB(ε) = d(X) for the special case of discrete-continuous mixtures.The following lemma is needed in the converse proof.

Lemma 12 ([62, Theorem 4.16]). Any Borel set A ⊂ Rk whose upper Minkowskidimension is strictly less than k has zero Lebesgue measure.

Proof of Theorem 15. (Achievability) Let the distribution of X be

µ = (1− ρ)µd + ρµc, (3.70)

where 0 ≤ ρ ≤ 1, µc is a probability measure on (R,B) absolutely continuous withrespect to Lebesgue measure and µd is a discrete probability measure. Let A be thecollection of all the atoms of µd, which is, by definition, a countable subset of R.

Let Wi = 1Xi /∈A. Then Wi is a sequence of i.i.d. binary random variables withexpectation

EWi = P Xi /∈ A = (1− ρ)µd(Ac) + ρµc(Ac) = ρ. (3.71)

By the weak law of large numbers (WLLN),

1

n|spt(Xn)| = 1

n

n∑i=1

WiP−→ρ. (3.72)

where the generalized support of vector xn is defined as

spt(xn) , i = 1, . . . , n : xi /∈ A. (3.73)

Fix an arbitrary δ > 0 and let

Cn = xn ∈ Rn : |spt(xn)| < (ρ+ δ)n . (3.74)

By (3.72), for sufficiently large n, P Xn ∈ Cn ≥ 1− ε/2. Decompose Cn as:

Cn =⋃

T⊂1,...,n|T |<(ρ+δ)n

⋃z∈An−|T |

UT,z, (3.75)

whereUT,z = xn ∈ Rn : spt(xn) = T, xnT c = z . (3.76)

Note that the collection of all UT,z is countable, and thus we may relabel them asUj : j ∈ N. Then Cn =

⋃j Uj. Hence there exists J ∈ N and R > 0, such that

P Xn ∈ Sn ≥ 1− ε, where

Sn = B(0, R) ∩J⋃j=1

Uj. (3.77)

34

Then

dimBSn = maxj=1,...,J

dimBUj ∩B(0, R) (3.78)

≤ (ρ+ δ)n, (3.79)

where (3.78) is by Lemma 8, and (3.79) follows from dimBUj∩B(0, R) ≤ |T | < (ρ+δ)nfor each j. This proves the ε-achievability of ρ+δ. By the arbitrariness of δ, RB(ε) ≤ ρ.

(Converse) Since RB(ε) ≥ 0, we can assume ρ > 0. Let Sn ∈ Rn be such thatP Xn ∈ Sn ≥ 1− ε. Define

Tn = xn ∈ Rn : |spt(xn)| > (ρ− δ)n . (3.80)

By (3.72), P Xn ∈ Tn ≥ 1 − ε for sufficiently large n. Let Gn = Sn ∩ Tn, thenP Xn ∈ Gn ≥ 1− 2ε. Write Gn as the disjoint union:

Gn =⋃

T⊂1,...,n|T |>(ρ−δ)n

⋃z∈An−|T |

VT,z, (3.81)

whereVT,z = UT,z ∩Gn. (3.82)

Also let ET,z = xnT : xn ∈ VT,z. Since P Xn ∈ Gn ≥ 1 − 2ε > 0, there existsT ⊂ 1, . . . , n and z ∈ An−|T |, such that P Xn ∈ VT,z > 0. Note that

P Xn ∈ VT,z

= ρ|T |(1− ρ)n−|T |µ|T |c (ET,z)

n−|T |∏i=1

µd(zi) > 0. (3.83)

If ρ < 1, µ|T |c (ET,z) > 0, which implies ET,z ⊂ R|T | has positive Lebesgue measure.

By Lemma 12, dimBET,z = |T | > (ρ− δ)n, hence

dimBSn ≥ dimBVT,z = dimBET,z > (ρ− δ)n. (3.84)

If ρ = 1, µ = µc implies that Sn ⊂ Rn has positive Lebesgue measure. Thus dimBSn =n. By the arbitrariness of δ > 0, the proof of RB(ε) ≥ ρ is complete.

Next we prove RB(ε) = d(X) for self-similar sources under the same assumptionof Theorem 11. This result is due to the stationarity and ergodicity of the underlyingdiscrete process that generates the analog source distribution.

Proof of Theorem 16. By Theorem 11, d(X) is finite. Therefore the converse followsfrom Theorem 14. To show achievability, we invoke the following definition. Definethe local dimension of a Borel measure ν on Rn as the function (if the limit exists)[46, p. 169]

dimlocν(x) = limr↓0

log ν(B(x, r))

log r(3.85)

35

Denote the distribution of Xn by the product measure µn, which is also self-similarand satisfies the strong separation theorem. By [46, Lemma 6.4(b) and Proposition10.6],

dimlocµn(x) = d(µn) = nd(X) (3.86)

holds for µn-almost every x. Define the sequence of random variables

Dm(Xn) =1

mlog

1

µn(B(Xn, 2−m)). (3.87)

Then (3.86) implies that Dma.s.−−→ nd(X) as m→∞. Therefore for all 0 < ε < 1 and

δ > 0, there exists M , such that

P

∞⋂m=M

Dm(Xn) ≤ nd(X) + δ

≥ 1− ε. (3.88)

Let

Sn =∞⋂

m=M

xn : Dm(xn) ≤ nd(X) + δ (3.89)

=∞⋂

m=M

xn : µn(B(xn, 2−m)) ≥ 2−m(nd(X)+δ). (3.90)

Then P Xn ∈ Sn ≥ 1− ε in view of (3.88), and

dimBSn = lim supm→∞

logNSn(2−m)

m(3.91)

≤nd(X) + δ, (3.92)

where (3.92) follows from (3.90) and µn(Sn) ≤ 1. Hence RB(ε) ≤ d(X) is proved.

Finally we prove RB(ε) = d(X) for memoryless sources whose M -ary expansionconsisting of independent digits.

Proof of Theorem 17. Without loss of generality, assume M = 2, that is, the binaryexpansion of X consists of independent bits. We follow the same steps as in theachievability proof of Theorem 14. Suppose for now that we can construct a sequenceof subsets V n

m, such that (3.54) holds and their cardinality does not exceed:

|V nm| ≤ 2n(H([X]m)+mδ) (3.93)

36

By the same arguments that lead to (3.58), the sets defined in (3.56) satisfy

dimBTnm = lim sup

j→∞

log |[T nm]j|j

(3.94)

= n lim supj→∞

H([X]j) + jδ

j(3.95)

= n(d+ δ). (3.96)

Since limm→∞ P Xn ∈ T nm = 1, this shows the ε-achievability of d.Next we proceed to construct the required V n

m. Applying Lemma 10 to thediscrete memoryless source [X]m and blocklength n yields

p∗n([X]m,mδ) ≤ 2−nE0(mδ). (3.97)

By the assumption that X is bounded, without loss of generality, we shall assume|X| ≤ 1 a.s. Therefore the alphabet size of [X]m is at most 2m. Simply applying(3.31) to [X]m yields the lower bound

E0([X]m,mδ) ≥δ2

2, (3.98)

which does not grow with m and cannot suffice for our purpose of constructing V nm.

Exploiting the structure that [X]m consists of independent bits, we can show a muchbetter bound:

E0([X]m,mδ) ≥mδ2

2. (3.99)

Then by (3.97) and (3.99), there exists V nm, such that (3.93) holds and

P [Xn]m /∈ V nm ≤ 2−

nmδ2

2 , (3.100)

which implies (3.54). This yields the desired RB(ε) ≤ d.To finish the proof, it remains to establish (3.99). By (3.31), for all δ ∈ R,

E0(P, δ) ≥ c(δ)

log2 |X |, (3.101)

where

c(x) =log e

2(x+)2, (3.102)

which is a nonnegative nondecreasing convex function.Since |X| ≤ 1 a.s., [X]m is in one-to-one correspondence with [(X)1, . . . , (X)m].

Denote the distribution of [(X)1, . . . , (X)m] and (X)i by Pm and Pi respectively. Byassumption, (X)1, . . . , (X)m are independent, hence

Pm = P1 × · · · × Pm. (3.103)

37

By (3.26),E0(Pm,mδ) = min

Qm:H(Qm)≥H([X]m)+mδD(Qm||Pm), (3.104)

where Qm is a distribution on Zm2 . Denote the marginals of Qm by Q1, . . . , Qm.Combining (3.103) with properties of entropy and relative entropy, we have

D(Qm||Pm) ≥m∑i=1

D(Qi||Pi), (3.105)

and

H(Pm) =m∑i=1

H(Pi) (3.106)

H(Qm) ≤m∑i=1

H(Qi) (3.107)

Therefore,

E0(Pm,mδ) = minH(Qm)≥H([X]m)+mδ

D(Qm||Pm) (3.108)

≥ minH(Qm)≥H([X]m)+mδ

m∑i=1

D(Qi||Pi) (3.109)

≥ min∑αi=H([X]m)+mδ

m∑i=1

minH(Qi)≥αi

D(Qi||Pi) (3.110)


m∑i=1

E0(Pi, αi −H(Pi)) (3.111)


m∑i=1

c(αi −H(Pi)) (3.112)

= min∑βi=mδ

m∑i=1

c(βi) (3.113)

≥ mc(δ) (3.114)

=1

2mδ2 log e, (3.115)

where

• (3.109): by (3.105).• (3.110): by (3.107), we have

Qm : H(Qm) ≥ H([X]m) +mδ

⊂Qm : H(Qi) ≥ αi,

∑αi = H([X]m) +mδ

. (3.116)

38

• (3.112): by (3.101).• (3.113): let βi = αi −H(Xi). Then

∑βi = mδ, by (3.106).

• (3.114): due to the convexity of x 7→ (x+)2.

39

Chapter 4

MMSE dimension

In this chapter we introduce the MMSE dimension, an information measure thatgoverns the high-SNR asymptotics of MMSE of estimating a random variable cor-rupted by additive noise. It is also the asymptotic ratio of nonlinear MMSE to linearMMSE. Section 4.4 explores the relationship between the MMSE dimension and theinformation dimension. Section 4.5 gives results on (conditional) MMSE dimensionfor various input distributions. When the noise is Gaussian, we show that the MMSEdimension and the information dimension coincide whenever the input distributionhas no singular component. Technical proofs are relegated to Appendix B. The ma-terial in this chapter has been presented in part in [63, 28, 64].

4.1 Basic setup

The minimum mean square error (MMSE) plays a pivotal role in estimation theoryand Bayesian statistics. Due to the lack of closed-form expressions for posterior distri-butions and conditional expectations, exact MMSE formulae are scarce. Asymptoticanalysis is more tractable and sheds important insights about how the fundamentalestimation-theoretic limits depend on the input and noise statistics. The theme ofthis chapter is the high-SNR scaling law of MMSE of estimating a random variablebased on observations corrupted by additive noise.

The MMSE of estimating X based on Y is denoted by1

mmse(X|Y ) = inffE[(X − f(Y ))2

](4.1)

= E[(X − E[X|Y ])2

]= E [var(X|Y )] , (4.2)

1Throughout this chapter, the probability space (Ω,F ,P) is fixed. For p ≥ 1, Lp(Ω) denotes thecollection of all random variables defined on (Ω,F ,P) with finite pth moments. L∞(Ω) denotes thecollection of all almost surely (a.s.) bounded random variables. fX denotes the density of a randomvariable X whose distribution is absolutely continuous with respect to Lebesgue measure. µ ⊥ νdenotes that µ and ν are mutually singular, i.e., there exists a measurable set E such that µ(E) = 1and ν(E) = 0. NG denotes a standard normal random variable. For brevity, in this chapter naturallogarithms are adopted and information units are nats.

40

where the infimum in (4.1) is over all Borel measurable f . When Y is related to Xthrough an additive-noise channel with gain

√snr, i.e.,

Y =√snrX +N (4.3)

where N is independent of X, we denote

mmse(X,N, snr) , mmse(X|√snrX +N), (4.4)

and, in particular, when the noise is Gaussian, we simplify

mmse(X, snr) , mmse(X,NG, snr). (4.5)

When the estimator has access to some side information U such that N is independentof X,U, we define

mmse(X,N, snr|U) , mmse(X|√snrX +N,U) (4.6)

= E[(X − E[X|√snrX +N,U ])2]. (4.7)

Assuming 0 < varN < ∞, next we proceed to define the (conditional) MMSEdimension formally. We focus particular attention on the case of Gaussian noise.First note the following general inequality:

0 ≤ mmse(X,N, snr|U) ≤ varN

snr. (4.8)

where the rightmost side can be achieved using the affine estimator f(y) = y−EN√snr

.Therefore as snr→∞, it holds that:

mmse(X,N, snr|U) = O

(1

snr

). (4.9)

We are interested in the exact scaling constant, which depends on the distribution ofX,U and N . To this end, we introduce the following notion:

Definition 8. Define the upper and lower MMSE dimension of the pair (X,N) asfollows:

D(X,N) = lim supsnr→∞

snr ·mmse(X,N, snr)

varN, (4.10)

D(X,N) = lim infsnr→∞

snr ·mmse(X,N, snr)

varN. (4.11)

If D(X,N) = D(X,N), the common value is denoted by D(X,N), called the MMSEdimension of (X,N). In particular, when N is Gaussian, we denote these limitsby D(X),D(X) and D(X), called the upper, lower and MMSE dimension of Xrespectively.

41

Replacing mmse(X,N, snr) by mmse(X,N, snr|U), the conditional MMSE dimen-sion of (X,N) given U can be defined similarly, denoted by D(X,N |U),D(X,N |U)and D(X,N |U) respectively. WhenN is Gaussian, we denote them by D(X|U),D(X|U)and D(X|U), called the upper, lower and conditional MMSE dimension of X givenU respectively.

The MMSE dimension of X governs the high-SNR scaling law of mmse(X, snr). IfD(X) exists, we have

mmse(X, snr) =D(X)

snr+ o

(1

snr

). (4.12)

As we show in Section 4.3.2, MMSE dimension also characterizes the high-SNR sub-optimality of linear estimation.

The following proposition is a simple consequence of (4.8):

Theorem 18.0 ≤ D(X,N |U) ≤ D(X,N |U) ≤ 1. (4.13)

The next result shows that MMSE dimension is invariant under translations andpositive scaling of input and noise.

Theorem 19. For any α, β, γ, η ∈ R, if either

• αβ > 0

or

• αβ 6= 0 and either X or N has a symmetric distribution,

then

D(X,N) = D(αX + γ, βN + η), (4.14)

D(X,N) = D(αX + γ, βN + η). (4.15)

Proof. Appendix B.1.

Remark 5. Particularizing Theorem 19 to Gaussian noise, we obtain the scale-invariance of MMSE dimension: D(αX) = D(X) for all α 6= 0, which can be gener-alized to random vectors and all orthogonal transformations. However, it is unclearwhether the invariance of information dimension under bi-Lipschitz mappings estab-lished in Theorem 2 holds for MMSE dimension.

Although most of our focus is on square integrable random variables, the functionalmmse(X,N, snr) can be defined for infinite-variance noise. Consider X ∈ L2(Ω) butN /∈ L2(Ω). Then mmse(X,N, snr) is still finite but (4.10) and (4.11) cease to makesense. Hence the scaling law in (4.9) could fail. It is instructive to consider thefollowing example:

42

Example 1. Let X be uniformly distributed in [0, 1] and N have the following den-sity:

fNα(z) =α− 1

zα1z>1, 1 < α ≤ 3. (4.16)

Then EN2α = ∞. As α decreases, the tail of Nα becomes heavier and accordingly

mmse(X,Nα, snr) decays slower. For instance, for α = 2 and α = 3 we obtain (seeAppendix B.2)

mmse(X,N2, snr) =3 + π2

18

1√snr− log2 snr

4snr+ Θ

(log snr

snr

), (4.17)

mmse(X,N3, snr) =log snr

snr− 2(2 + log 2)

snr+ Θ

(1

snr3/2

)(4.18)

respectively. Therefore in both cases the MMSE decays strictly slower than 1snr

, i.e.,for N = N2 or N3,

mmse(X,N, snr) = ω

(1

snr

). (4.19)

However, varN = ∞ does not alway imply (4.19). For example, consider anarbitrary integer-valued N (none of whose moments may exist) and 0 < X < 1 a.s.Then mmse(X,N, snr) = 0 for all snr > 1.

4.2 Related work

The low-SNR asymptotics of mmse(X, snr) has been studied extensively in [65, 66].In particular, it is shown in [66, Proposition 7] that if all moments of X are finite,then mmse(X, ·) is smooth on R+ and admits a Taylor expansion at snr = 0 up toarbitrarily high order. For example, if E [X] = 0 and E [X2] = 1, then as snr→ 0,

mmse(X, snr) = 1− snr + (2− (EX3)2)snr2

2(4.20)

−(15− 12(EX3)2 − 6EX4 + (EX4)2

)snr36

+O(snr4).

However, the asymptotics in the high-SNR regime remain underexplored in the lit-erature. In [65, p. 1268] it is pointed out that the high-SNR behavior depends onthe input distribution: For example, for binary X, mmse(X, snr) decays exponen-tially, while for standard Gaussian X, mmse(X, snr) = 1

snr+1. Unlike the low-SNR

regime, the high-SNR behavior is considerably more complicated, as it depends onthe measure-theoretical structure of the input distribution rather than moments. Onthe other hand, it can be shown that the high-SNR asymptotics of MMSE is equiva-lent to the low-SNR asymptotics when the input and noise distributions are switched.As we show in Section 4.3.4, this simple observation yields new low-SNR results forGaussian input contaminated by non-Gaussian noise.

43

The asymptotic behavior of Fisher’s information (closely related to MMSE whenN is Gaussian) is conjectured in [26, p. 755] to satisfy

limsnr→∞

J(√snrX +NG) = 1− d(X). (4.21)

As a corollary to our results, we prove this conjecture when X has no singular com-ponents. The Cantor distribution gives a counterexample to the general conjecture(Section 4.6.1).

Other than our scalar Bayesian setup, the weak-noise asymptotics of optimal esti-mation/filtering error has been studied in various regimes in statistics. One exampleis filtering a deterministic signal observed in weak additive white Gaussian noise(AWGN): Pinsker’s theorem ([67, 68]) establishes the exact asymptotics of the op-timal minimax square error when the signal belongs to a Sobolev class with finiteduration. For AWGN channels and stationary Markov input processes that satisfya stochastic differential equation, it is shown in [69, p. 372] that, under certain

regularity conditions, the filtering MMSE decays as Θ(

1√snr

).

4.3 Various connections

4.3.1 Asymptotic statistics

The high-SNR behavior of mmse(X, snr) is equivalent to the behavior of quadraticBayesian risk for the Gaussian location model in the large sample limit, where PXis the prior distribution and the sample size n plays the role of snr. To see this, letNi : i ∈ N be a sequence of i.i.d. standard Gaussian random variables independentof X and denote Yi = X + Ni and Y n = (Y1, . . . , Yn). By the sufficiency of samplemean Y = 1

n

∑ni=1 Yi in Gaussian location models, we have

mmse(X|Y n) = mmse(X|Y ) = mmse(X,n), (4.22)

where the right-hand side is the function mmse(X, ·) defined in (4.5) evaluated at n.Therefore as the sample size grows, the Bayesian risk of estimating X vanishes asO(

1n

)with the scaling constant given by the MMSE dimension of the prior 2

D(X) = limn→∞

nmmse(X|Y n). (4.23)

The asymptotic expansion of mmse(X|Y n) has been studied in [70, 71, 72] for ab-solutely continuous priors and general models where X and Y are not necessarilyrelated by additive Gaussian noise. Further comparison to our results is given inSection 4.5.3.

2In view of Lemma 13 in Section 4.6, the limit of (4.23) is unchanged even if n takes non-integervalues.

44

4.3.2 Optimal linear estimation

MMSE dimension characterizes the gain achievable by nonlinear estimation overlinear estimation in the high-SNR regime. To see this, define the linear MMSE(LMMSE) as

lmmse(X|Y ) = mina,b∈R

E[(aY + b−X)2]. (4.24)

When Y =√snrX + N , direct optimization over a and b reveals that the best pa-

rameters are given by

a =

√snr varX

snr varX + varN, (4.25)

b =E[X] varN −

√snrE[N ] varX

snr varX + varN. (4.26)

Hence

lmmse(X,N, snr) , lmmse(X|√snrX +N) (4.27)

=varN varX

snr varX + varN(4.28)

=varN

snr+ o

(1

snr

), (4.29)

as snr→∞. As long as varN <∞, the above analysis holds for any input X even ifvarX = ∞, in which case (4.28) simplifies to lmmse(X|

√snrX + N) = varN

snr. In view

of Definition 8, (4.29) gives a more general definition of MMSE dimension:

D(X,N) = limsnr→∞

mmse(X,N, snr)

lmmse(X,N, snr), (4.30)

which can be easily generalized to random vectors or processes.It is also possible to show that (4.30) carries over to non-square loss functions

that behaves quadratically near zero. To this end, consider the loss function `(x, y) =ρ(x− y) where ρ ∈ C2(R) is convex with ρ′(0) = 0 and ρ′′(0) > 0. The Bayesian riskof estimating X based on

√snrX +N with respect to ρ is

R(X,N, snr) = inffE[ρ(X − f(

√snrX +N))

], (4.31)

while the linear Bayesian risk is

RL(X,N, snr) = infa,b∈R

E[ρ(X − a(

√snrX +N)− b)

]. (4.32)

We conjecture that (4.30) holds with mmse and lmmse replaced by R and RL respec-tively.

45

4.3.3 Asymptotics of Fisher information

In the special case of Gaussian noise it is interesting to draw conclusions on theasymptotic behavior of Fisher’s information based on our results. Recall that theFisher information (with respect to the location parameter) of a random variable Zis defined as: [73, Definition 4.1]

J(Z) = sup|E [ψ′(Z)] |2 : ψ ∈ C1, E[ψ2(Z)

]= 1 (4.33)

where C1 denotes the collection of all continuously differentiable functions. When Z

has an absolutely continuous density fZ , we have J(Z) =∫ f ′Z

2

fZ. Otherwise, J(Z) =

∞ [73].In view of the representation of MMSE by the Fisher information of the channel

output with additive Gaussian noise [74, (1.3.4)], [65, (58)]:

snr ·mmse(X, snr) = 1− J(√snrX +NG) (4.34)

and J(aZ) = a−2J(Z), letting ε = 1√snr

yields

mmse(X, ε−2) = ε2 − ε4J(X + εNG). (4.35)

By the lower semicontinuity of Fisher information [73, p. 79], when the distributionof X is not absolutely continuous, J(X + εNG) diverges as ε vanishes, but no fasterthan

J(X + εNG) ≤ ε−2, (4.36)

because of (4.35). Similarly to the MMSE dimension, we can define the Fisher di-mension of a random variable X as follows:

J (X) = lim supε↓0

ε2 · J(X + εNG), (4.37)

J (X) = lim infε↓0

ε2 · J(X + εNG). (4.38)

Equation (4.35) shows Fisher dimension and MMSE dimension are complementaryof each other:

J (X) + D(X) = J (X) + D(X) = 1. (4.39)

In [26, p.755] it is conjectured that

J (X) = 1− d(X) (4.40)

or equivalentlyD(X) = d(X). (4.41)

46

According to Theorem 4, this holds for distributions without singular componentsbut not in general. Counterexamples can be found for singular X. See Section 4.5.5for more details.

4.3.4 Duality to low-SNR asymptotics

Note that

snr ·mmse(X,N, snr) = mmse(√snrX|

√snrX +N) (4.42)

= mmse(N |√snrX +N) (4.43)

= mmse(N,X, snr−1

), (4.44)

which gives an equivalent definition of the MMSE dimension:

D(X,N) = limε↓0

mmse(N,X, ε). (4.45)

This reveals an interesting duality: the high-SNR MMSE scaling constant is equalto the low-SNR limit of MMSE when the roles of input and noise are switched.Restricted to the Gaussian channel, it amounts to studying the asymptotic MMSEof estimating a Gaussian random variable contaminated with strong noise with anarbitrary distribution. On the other end of the spectrum, the asymptotic expansionof the MMSE of an arbitrary random variable contaminated with strong Gaussiannoise is studied in [66, Section V.A]. The asymptotics of other information measureshave also been studied: For example, the asymptotic Fisher information of Gaussian(or other continuous) random variables under weak arbitrary noise was investigatedin [75]. The asymptotics of non-Gaussianness (see (2.68)) in this regime is studied in[76, Theorem 1]. The second-order asymptotics of mutual information under strongGaussian noise is studied in [77, Section IV].

Unlike mmse(X, snr) which is monotonically decreasing with snr, mmse(NG|√snrX+

NG) may be increasing in snr (Gaussian X), decreasing (binary-valued X) or oscil-latory (Cantor X) in the high-SNR regime (see Fig. 4.3). In those cases in whichsnr 7→ mmse(NG|

√snrX + NG) is monotone, MMSE dimension and information

dimension exist and coincide, in view of (4.45) and Theorem 20.

4.4 Relationship between MMSE dimension and

Information Dimension

The following theorem reveals a connection between the MMSE dimension andthe information dimension of the input:

Theorem 20. If H(bXc) <∞, then

D(X) ≤ d(X) ≤ d(X) ≤ D(X). (4.46)

47

Therefore, if D(X) exists, then d(X) exists and

D(X) = d(X), (4.47)

and equivalently, as snr→∞,

mmse(X, snr) =d(X)

snr+ o

(1

snr

). (4.48)

In view of (4.13) and (2.14), (4.46) fails when H(bXc) =∞. A Cantor-distributedX provides an example in which the inequalities in (4.46) are strict (see Section 4.5.5).However, whenever X has a discrete-continuous mixed distribution, (4.47) holds andthe information dimension governs the high-SNR asymptotics of MMSE. It is alsoworth pointing out that (4.46) need not hold when the noise is non-Gaussian. SeeFig. 4.2 for a counterexample when the noise is uniformly distributed on the unitinterval.

The proof of Theorem 20 hinges on two crucial results:

• The I-MMSE relationship [65]:3

d

dsnrI(snr) =

1

2mmse(snr). (4.49)

where we have used the following short-hand notations:

I(snr) = I(X;√snrX +NG), (4.50)

mmse(snr) = mmse(X, snr). (4.51)

• The high-SNR scaling law of I(snr) in Theorem 9.

Before proceeding to the proof, we first outline a naıve attempt at proving thatMMSE dimension and information dimension coincide. Assuming

limsnr→∞

I(snr)

log snr=d(X)

2, (4.52)

it is tempting to apply the l’Hopital’s rule to (4.52) to conclude

limsnr→∞

ddsnr

I(snr)1snr

=d(X)

2, (4.53)

which, combined with (4.49), would produce the desired result in (4.47). However,this approach fails because applying l’Hopital’s rule requires establishing the exis-tence of the limit in (4.53) in the first place. In fact, we show in Section 4.5.5 whenX has certain singular (e.g., Cantor) distribution, the limit in (4.53), i.e. the MMSE

3The previous result in [65, Theorem 1] requires E[X2]< ∞, which, as shown in [64], can be

weakened to H(bXc) <∞.

48

dimension, does not exist because of oscillation. Nonetheless, because mutual infor-mation is related to MMSE through an integral relation, the information dimensiondoes exist since oscillation in MMSE is smoothed out by the integration.

In fact it is possible to construct a function I(snr) which satisfies all the mono-tonicity and concavity properties of mutual information [65, Corollary 1]:

I > 0, I ′ > 0, I ′′ < 0. (4.54)

I(0) = I ′(∞) = 0, I(∞) =∞; (4.55)

yet the limit in (4.53) does not exist because of oscillation. For instance,

I(snr) =d

2

[log(1 + snr) +

1− cos(log(1 + snr))

2

](4.56)

satisfies (4.52), (4.54) and (4.55), but (4.53) fails, since

I ′(snr) =d [2 + sin(log(1 + snr))]

4(1 + snr). (4.57)

In fact (4.57) gives a correct quantitative depiction of the oscillatory behavior ofmmse(X, snr) for Cantor-distributed X (see Fig. 4.1).

Proof of Theorem 20. First we prove (4.46). By (4.49), we obtain

I(snr) =1

2

∫ snr

0

mmse(γ)dγ. (4.58)

By definition of D(X), for all δ > 0, there exists γ0, such that γmmse(γ) ≥ D(X)−δholds for all γ > γ0. Then by (4.58), for any snr > γ0,

I(snr) ≥ 1

2

∫ γ0

0

mmse(γ)dγ +1

2

∫ snr

γ0

D(X)− δγ

dγ (4.59)

≥ 1

2(D(X)− δ) log snr +O (1). (4.60)

In view of (2.67), dividing both sides by 12

log snr and by the arbitrariness of δ > 0,

we conclude that D(X) ≤ d(X). Similarly, D(X) ≥ d(X) holds.Next, assuming the existence of D(X), (4.47) simply follows from (4.46), which

can also be obtained by applying l’Hopital’s rule to (4.52) and (4.53).

4.5 Evaluation of MMSE dimension

In this section we drop the assumption of H(bXc) <∞ and proceed to give resultsfor various input and noise distributions.

49

4.5.1 Data processing lemma for MMSE dimension

Theorem 21. For any X,U and any N ∈ L2(Ω),

D(X,N) ≥ D(X,N |U), (4.61)

D(X,N) ≥ D(X,N |U). (4.62)

When N is Gaussian and U is discrete, (4.61) and (4.62) hold with equality.


In particular, Theorem 21 states that no discrete side information can reduce theMMSE dimension. Consequently, we have

D(X|[X]m) = D(X) (4.63)

for any m ∈ N, that is, knowing arbitrarily finitely many digits X does not reduceits MMSE dimension. This observation agrees with our intuition: when the noise isweak, [X]m can be estimated with exponentially small error (see (4.68)), while thefractional part X − [X]m is the main contributor to the estimation error.

It is possible to extend Theorem 21 to non-Gaussian noise (e.g., uniform, expo-nential or Cauchy distributions) and more general models. See Remark 33 at the endof Appendix B.3.

4.5.2 Discrete input

Theorem 22. If X is discrete (finitely or countably infinitely valued), and N ∈ L2(Ω)whose distribution is absolutely continuous with respect to Lebesgue measure, then

D(X,N) = 0. (4.64)

In particular,D(X) = 0. (4.65)

Proof. Since constants have zero MMSE dimension, D(X) = D(X|X) = 0, in viewof Theorem 21. The more general result in (4.64) is proved in Appendix B.4.

Remark 6. Theorem 22 implies that mmse(X,N, snr) = o(

1snr

)as snr → ∞. As

observed in Example 4, MMSE can decay much faster than polynomially. Supposethe alphabet of X, denoted by A = xi : i ∈ N, has no accumulation point. Then

dmin , infi 6=j|xi − xj| > 0. (4.66)

If N is almost surely bounded, say |N | ≤ A, then mmse(X,N, snr) = 0 for all snr >(2Admin

)2

. On the other hand, if X is almost surely bounded, say |X| ≤ A, then

mmse(X, snr) decays exponentially: since the error probability, denoted by pe(snr), of

50

a MAP estimator for X based on√snrX + NG is O

(Q(√

snr2dmin

)), where Q(t) =∫∞

tϕ(x)dx and ϕ is the standard normal density

ϕ(x) =1√2π

e−x2

2 . (4.67)

Hence

mmse(snr) ≤ A2pe(snr) = O

(1√snr

e−snr8d2min

). (4.68)

If the input alphabet has accumulation points, it is possible that the MMSE decayspolynomially. For example, when X takes values on

0, 1

2, 1

4, 1

8, . . .

and N is uniform

distributed on [0, 1], it can be shown that mmse(X,N, snr) = Θ(snr−2).

4.5.3 Absolutely continuous input

Theorem 23. Suppose the density of N ∈ L2(Ω) is bounded and is such that forsome α > 1,

fN(u) = O(|u|−α

), (4.69)

as |u| → ∞. ThenD(X,N) = 1 (4.70)

holds for all X with an absolutely continuous distribution with respect to Lebesguemeasure. In particular,

D(X) = 1, (4.71)

i.e.,

mmse(X, snr) =1

snr+ o

(1

snr

). (4.72)


If the density of X is sufficiently regular, then (4.70) holds for all noise distribu-tions:

Theorem 24. Suppose the density of X is continuous and bounded. Then

D(X,N) = 1 (4.73)

holds for all (not necessarily absolutely continuous) N ∈ L2(Ω).


In view of Theorem 24 and (4.30), we conclude that the linear estimator is dimen-sionally optimal for estimating absolutely continuous random variables contaminatedby additive Gaussian noise, in the sense that it achieves the input MMSE dimension.

51

A refinement of Theorem 24 entails the second-order expansion for mmse(X, snr)for absolutely continuous input X. This involves the Fisher information of X. Sup-pose J(X) <∞. Then

J(√snrX +NG) ≤

∫ϕ(z)J(

√snrX + z)dz (4.74)

=

∫ϕ(z)J(

√snrX)dz (4.75)

=J(X)

snr<∞, (4.76)

where (4.74) and (4.75) follow from the convexity and translation invariance of Fisherinformation respectively. In view of (4.35), we have

mmse(X, snr) =1

snr+O

(1

snr2

), (4.77)

which improves (4.72) slightly.Under stronger regularity conditions the second-order term can be determined

exactly. A result of Prelov and van der Meulen [75] states the following: if J(X) <∞and the density of X satisfies certain regularity conditions [75, (3) – (7)], then

J(X + εNG) = J(X) +O (ε). (4.78)

Therefore in view of (4.34), we have

mmse(X, snr) =1

snr− J(X)

snr2+ o

(1

snr2

). (4.79)

This result can be understood as follows: Stam’s inequality [78] implies that

1

J(√snrX +NG)

≥ 1

J(√snrX)

+1

J(NG)(4.80)

=snr

J(X)+ 1. (4.81)

Using (4.34), we have

mmse(X, snr) ≥ 1

J(X) + snr(4.82)

=1

snr− J(X)

snr2+ o

(1

snr2

). (4.83)

Inequality (4.82) is also known as the Bayesian Cramer-Rao bound (or the Van Treesinequality) [79, pp. 72–73], [80, Corollary 2.3]. In view of (4.79), we see that (4.82)is asymptotically tight for sufficiently regular densities of X.

52

Instead of using the asymptotic expansion of Fisher information and Stam’s in-equality, we can show that (4.79) holds for a much broader class of densities of X andnon-Gaussian noise:

Theorem 25. Suppose X ∈ L3(Ω) with bounded density fX whose first two deriva-tives are also bounded and J(X) <∞. Furthermore, assume fX satisfies the followingregularity condition [81, Equation (2.5.16)]:∫

Rf ′′X(x)dx = 0. (4.84)

Then for any N ∈ L2(Ω),

mmse(X,N, snr) =varN

snr− J(X)

(varN

snr

)2

+ o

(1

snr2

). (4.85)


Combined with (4.83), Theorem 25 implies that the Bayesian Cramer-Rao bound

mmse(X,N, snr) ≥ 1

J(X) + snrJ(N)(4.86)

=1

snr J(N)− J(X)

snr2 J2(N)+ o

(1

snr2

)(4.87)

is asymptotically tight if and only if the noise is Gaussian.Equation (4.85) reveals a new operational role of J(X). The regularity conditions

imposed in Theorem 25 are much weaker and easier to check than those in [75];however, (4.79) is slightly stronger than (4.85) because the o(snr−2) term in (4.79) is

in fact O(snr−52 ), as a result of (4.78).

It is interesting to compare (4.85) to results on asymptotic Bayesian risk in thelarge sample limit [70, Theorem 5.1a]:

• When N is Gaussian, recalling (4.22), we have

mmse(X|Y n) = mmse(X,n) (4.88)

=1

n− J(X)

n2+ o

(1

n2

). (4.89)

This agrees with the results in [70] but our proof requires less stringent condi-tions on the density of X.

• When N is non-Gaussian, under the regularity conditions in [70, Theorem 5.1a],we have

mmse(X|Y n) =1

nJ(N)− J(X)

n2 J2(N)+ o

(1

n2

), (4.90)

53

which agrees with the Bayesian Cramer-Rao bound in the product case. Onthe other hand, by (4.85) we have

mmse(X|Y ) = mmse(X,n) (4.91)

=varN

n− J(X)

(varN

n

)2

+ o

(1

n2

). (4.92)

whose first-order term is inferior to (4.90), due to the Cramer-Rao boundvarN ≥ 1

J(N). This agrees with the fact that the sample mean is asymptotically

suboptimal for non-Gaussian noise, and the suboptimality is characterized bythe gap in the Cramer-Rao inequality.

To conclude the discussion of absolutely continuous inputs, we give an examplewhere (4.85) fails:

Example 2. Consider X and N uniformly distributed on [0, 1]. It is shown in Ap-pendix B.2 that

mmse(X,N, snr) = varN

(1

snr− 1

2snr32

), snr ≥ 4. (4.93)

Note that (4.85) does not hold because X does not have a differentiable density, henceJ(X) =∞. This example illustrates that the o

(1snr

)term in (4.72) is not necessarily

O(

1snr2

).

4.5.4 Mixed distribution

Next we present results for general mixed distributions, which are direct conse-quences of Theorem 21. The following result asserts that MMSE dimensions are affinefunctionals, a fundamental property shared by Renyi information dimension Theorem5.

Theorem 26.

D(∑

αiµi

)=∑

αiD(µi), (4.94)

D(∑

αiµi

)=∑

αiD(µi). (4.95)

where αi : i ∈ N is a probability mass function and each µi is a probability measure.

Another application of Theorem 21 is to determine the MMSE dimension of inputswith mixed distributions, which are frequently used in statistical models of sparsesignals [5, 4, 7, 82]. According to Theorem 21, knowing the support does not decreasethe MMSE dimension.

54

Corollary 1. Let X = UZ where U is independent of Z, taking values in 0, 1 withP U = 1 = ρ. Then

D(X) = ρD(Z), (4.96)

D(X) = ρD(Z). (4.97)

Obtained by combining Theorems 22, 23 and 26, the next result solves the MMSEdimension of discrete-continuous mixtures completely. Together with Theorem 20, italso provides an alternative proof of Renyi’s theorem on information dimension formixtures (Theorem 4) via MMSE.

Theorem 27. Let X have a discrete-continuous mixed distribution as defined in(1.4). Then its MMSE dimension equals the weight of the absolutely continuous part,i.e.,

D(X) = ρ. (4.98)

The above results also extend to non-Gaussian noise or general channels whichsatisfy the condition in Remark 33 at the end of Appendix B.3.

To conclude this subsection, we illustrate Theorem 27 by the following examples:

Example 3 (Continuous input). If X ∼ N (0, 1), then

mmse(X, snr) =1

snr + 1(4.99)

and (4.71) holds.

Example 4 (Discrete input). If X is equiprobable on −1, 1, then by (4.68) or [83,Theorem 3],

mmse(X, snr) = O

(1√snr

e−snr2

)(4.100)

and (4.65) holds.

Example 5 (Mixed input). Let N be uniformly distributed in [0, 1], and let X bedistributed according to an equal mixture of a mass at 0 and a uniform distributionon [0, 1]. Then

mmse(X,N, snr)

varN

=3

2− 3

4√snr

+1

snr− 1

4snr32

− 3√snr

2log

1 +√snr√

snr(4.101)

=1

2snr+

1

4snr32

+ o

(1

snr32

), (4.102)

which implies D(X,N) = 12

and verifies Theorem 27 for non-Gaussian noise.

55

4.5.5 Singularly continuous distribution

In this subsection we consider atomless input distributions mutually singular withrespect to Lebesgue measure. There are two new phenomena regarding MMSE di-mensions of singular inputs. First, the lower and upper MMSE dimensions D(X,N)and D(X,N) depend on the noise distribution, even if the noise is restricted to beabsolutely continuous. Second, the MMSE dimension of a singular input need notexist. For an important class of self-similar singular distributions (e.g., the Cantordistribution), the function snr 7→ snr mmse(X, snr) oscillates between the lower andupper dimension periodically in log snr (i.e., in dB). This periodicity arises from theself-similarity of the input, and the period can be determined exactly. Unlike thelower and upper dimension, the period does not depend on the noise distribution.

We focus on a special class of inputs with self-similar distributions [46, p.36]:inputs with i.i.d. digits. Consider X ∈ [0, 1] a.s. For an integer M ≥ 2, the M -aryexpansion of X is defined in (2.47). Assume that the sequence (X)j : j ∈ N isi.i.d. with common distribution P supported on 0, . . . ,M − 1. According to (2.50),the information dimension of X exists and is given by the normalized entropy rate ofthe digits:

d(X) =H(P )

logM. (4.103)

For example, if X is Cantor-distributed, then the ternary expansion of X consists ofi.i.d. digits, and for each j,

P (X)j = 0 = P (X)j = 2 = 1/2. (4.104)

By (4.103), the information dimension of the Cantor distribution is log3 2. The nexttheorem shows that for such X the scaling constant of MMSE oscillates periodically.

Theorem 28. Let X ∈ [0, 1] a.s., whose M-ary expansion defined in (2.47) consistsof i.i.d. digits with common distribution P . Then for any N ∈ L2(Ω), there exists a2 logM-periodic function4 ΦX,N : R→ [0, 1], such that as snr→∞,

mmse(X,N, snr) =varN

snrΦX,N(log snr) + o

(1

snr

). (4.105)

The lower and upper MMSE dimension of (X,N) are given by

D(X,N) = lim supb→∞

ΦX,N(b) = sup0≤b≤2 logM

ΦX,N(b), (4.106)

D(X,N) = lim infb→∞

ΦX,N(b) = inf0≤b≤2 logM

ΦX,N(b). (4.107)

4Let T > 0. We say a function f : R → R is T -periodic if f(x) = f(x+ T ) for all x ∈ R, and Tis called a period of f ([84, p. 183] or [85, Section 3.7]). This differs from the definition of the leastperiod which is the infimum of all periods of f .

56

Moreover, when N = NG is Gaussian, the average of ΦX,NGover one period coincides

with the information dimension of X:

1

2 logM

∫ 2 logM

0

ΦX,NG(b) db = d(X) =

H(P )

logM. (4.108)


Remark 7. Trivial examples of Theorem 28 include ΦX,N ≡ 0 (X = 0 or 1 a.s.) andΦX,N ≡ 1 (X is uniformly distributed on [0, 1]).

Theorem 28 shows that in the high-SNR regime, the function snr ·mmse(X,N, snr)is periodic in snr (dB) with period 20 logM dB. Plots are given in Fig. 4.1 – 4.2. Al-though this reveals the oscillatory nature of mmse(X,N, snr), we do not have a generalformula to compute the lower (or upper) MMSE dimension of (X,N). However, whenthe noise is Gaussian, Theorem 20 provides a sandwich bound in terms of the infor-mation dimension of X, which is reconfirmed by combining (4.106) – (4.108).

Remark 8 (Binary-valued noise). One interesting case for which we are able tocompute the lower MMSE dimension corresponds to binary-valued noise, with whichall singular inputs (including discrete distributions) have zero lower MMSE dimension(see Appendix B.7 for a proof). This phenomenon can be explained by the followingfact about negligible sets: a set with zero Lebesgue measure can be translated by anarbitrarily small amount to be disjoint from itself. Therefore, if an input is supportedon a set with zero Lebesgue measure, we can perform a binary hypothesis test basedon its noisy version, which admits a decision rule with zero error probability whenSNR is large enough.

4.6 Numerical Results

4.6.1 Approximation by discrete inputs

Due to the difficulty of computing conditional expectation and estimation errorin closed form, we capitalize on the regularity of the MMSE functional by computingthe MMSE of successively finer discretizations of a given X. For an integer m weuniformly quantize X to [X]m. Then we numerically compute mmse([X]m, snr) forfixed snr. By the weak continuity of PX 7→ mmse(X, snr) [86, Corollary 3], as thequantization level m grows, mmse([X]m, snr) converges to mmse(X, snr); however, onecaveat is that to obtain the value of MMSE within a given accuracy, the quantizationlevel needs to grow as snr grows (roughly as log snr) so that the quantization error ismuch smaller than the noise. Lastly, due to the following result, to obtain the MMSEdimension it is sufficient to only consider integer values of snr.

Lemma 13. It is sufficient to restrict to integer values of snr when calculatingD(X|U) and D(X|U) in (4.10) and (4.11) respectively.

57

Proof. For any snr > 0, there exists n ∈ Z+, such that n ≤ snr < n + 1. Since thefunction mmse(snr) = mmse(X, snr|U) is monotonically decreasing, we have

mmse(n+ 1) ≤ mmse(snr) ≤ mmse(n), (4.109)

hence

n mmse(n+ 1) ≤ snr mmse(snr) (4.110)

≤ n mmse(n) + mmse(n). (4.111)

Using that limn→∞mmse(n) = 0, the claim of Lemma 13 follows.

4.6.2 Self-similar input distribution

We numerically calculate the MMSE for Cantor distributed X and Gaussian noiseby choosing [X]m = b3mXc3−m. By (4.104), [X]m is equiprobable on the set Am,which has cardinality 2m and consists of all 3-adic fractionals whose ternary digitsare either 0 or 2. According to Theorem 28, in the high-SNR regime, snr mmse(X, snr)oscillates periodically in log snr with period 2 log 3, as plotted in Fig. 4.1. The lowerand upper MMSE dimensions of the Cantor distribution turn out to be (to six deci-mals):

D(X) = 0.621102, (4.112)

D(X) = 0.640861. (4.113)

Note that the information dimension d(X) = log3 2 = 0.630930 is sandwiched betweenD(X) and D(X), according to Theorem 20. From this and other numerical evidenceit is tempting to conjecture that

d(X) =D(X) + D(X)

2(4.114)

when the noise is Gaussian.It should be pointed out that the sandwich bounds in (4.46) need not hold when

N is not Gaussian. For example, in Fig. 4.2 where snr · mmse(X,N, snr) is plottedagainst log3 snr for X Cantor distributed and N uniformly distributed in [0, 1], it isevident that d(X) = log3 2 > D(X,N) = 0.5741.

4.6.3 Non-Gaussian noise

Via the input-noise duality in (4.44), studying high-SNR asymptotics provides in-sights into the behavior of mmse(X,N, snr) for non-Gaussian noise N . Combining var-ious results from Sections 4.5.2, 4.5.3 and 4.5.5, we observe that mmse(X,N, snr) canbehave very irregularly in general, unlike in Gaussian channels where mmse(X, snr)is strictly decreasing in snr. To illustrate this, we consider the case where standardGaussian input is contaminated by various additive noises. For all N , it is evident

58

2 4 6 8 10 12

0.58

0.6

0.62

0.64

d(X)

D(X)

D(X)

log3 snr

snrmmse(X

,snr)

Figure 4.1: Plot of snr mmse(X, snr) against log3 snr, where X has a ternary Cantordistribution.

that mmse(X,N, 0) = varX = 1. Due to Theorem 24, the MMSE vanishes as varNsnr

regardless of the noise. The behavior of MMSE associated with Gaussian, Bernoulliand Cantor distributed noises is (Fig. 4.3a):

• For standard Gaussian N , mmse(X, snr) = 11+snr

is continuous at snr = 0 and

decreases monotonically according to 1snr

. Recall that [86, Section III] this mono-tonicity is due to the MMSE data-processing inequality [87] and the stabilityof Gaussian distribution.

• For equiprobable Bernoulli N , mmse(X,N, snr) has a discontinuity of first kindat snr = 0.

varX = mmse(X,N, 0) > limsnr↓0

mmse(X,N, snr) = 0. (4.115)

As snr → 0, the MMSE vanishes according to O(

1√snr

e−1

2snr

), in view of (4.44)

and (4.68), and since it also vanishes as snr → ∞, it is not monotonic withsnr > 0.

• For Cantor distributed N , mmse(X,N, snr) has a discontinuity of second kindat snr = 0. According to Theorem 28, as snr→ 0, MMSE oscillates relentlesslyand does not have a limit (See the zoom-in plot in Fig. 4.3b).

59

2 4 6 8 10 12 14

0.54

0.55

0.56

0.57

D(X,N)

D(X,N)

log3 snr

snrmmse(X

,N,snr)

Figure 4.2: Plot of snr ·mmse(X,N, snr)/varN against log3 snr, where X has a ternaryCantor distribution and N is uniformly distributed in [0, 1].

4.7 Discussions

Through the high-SNR asymptotics of MMSE in Gaussian channels, we defineda new information measure called the MMSE dimension. Although stemming fromestimation-theoretic principles, MMSE dimension shares several important featureswith Renyi’s information dimension. By Theorem 21 and Theorem 5, they are bothaffine functionals. According to Theorem 20, information dimension is sandwichedbetween the lower and upper MMSE dimensions. For distributions with no singularcomponents, they coincide to be the weight of the continuous part of the distribution.The high-SNR scaling law of mutual information and MMSE in Gaussian channelsare governed by the information dimension and the MMSE dimension respectively.In Chapter 5, we show that the information dimension plays a pivotal role in almostlossless analog compression, an information-theoretic model for noiseless compressedsensing. In fact we show in Chapter 6 that the MMSE dimension is closely related tothe fundamental limit of noisy compressed sensing and the optimal phase transitionof noise sensitivity (see Theorem 48).

Characterizing the high-SNR suboptimality of linear estimation, (4.30) providesan alternative definition of MMSE dimension, which enables us to extend our resultsto random vectors or processes. In these more general setups, it is interesting to in-vestigate how the causality constraint of the estimator affects the high-SNR behavior

60

of the optimal estimation error. Another direction of generalization is to study thehigh-SNR asymptotics of MMSE with a mismatched model in the setup of [88] or[89].

61

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Bernoulli

Cantor

Gaussian

snr

mmse(X

,N,snr)

(a) Behavior of MMSE under various noise distributions.

0 0.1 0.2 0.3 0.4

0.6

0.61

0.62

0.63

0.64 D(N)

D(N)

snr

mmse(X

,N,snr)

(b) Low-SNR plot for Cantor distributed N .

Figure 4.3: Plot of mmse(X,N, snr) against snr, where X is standard Gaussian andN is standard Gaussian, equiprobable Bernoulli or Cantor distributed (normalized tohave unit variance).

62

Chapter 5

Noiseless compressed sensing

In this chapter we present the theory of almost lossless analog compression, aShannon theoretic framework for noiseless compressed sensing. It also extends theclassical almost lossless source coding theory to uncountable source and code alpha-bets. Section 5.2 introduces a unified framework for lossless analog compression. Weprove coding theorems under various regularity conditions on the compressor or de-compressor, which, in the memoryless case, involves Renyi information dimension ofthe source. These results provide new operational characterization of the informa-tion dimension in Shannon theory. Proofs are deferred till Sections 5.3 – 5.6. Sometechnical lemmas are proved in Appendix C. The material in this chapter has beenpresented in part in [27, 5].

5.1 Motivations

5.1.1 Noiseless compressed sensing versus data compression

The “bit” is the universal currency in lossless source coding theory [90], whereShannon entropy is the fundamental limit of compression rate for discrete memorylesssources. Sources are modeled by stochastic processes and redundancy is exploited asprobability is concentrated on a set of exponentially small cardinality as blocklengthgrows. Therefore by encoding this subset, data compression is achieved if we toleratea positive, albeit arbitrarily small, block error probability.

Compressed sensing, on the other hand, employs linear transformations to encodereal vectors by real-valued measurements. The formulation of noiseless compressedsensing is reminiscent of the traditional lossless data compression in the followingsense:

• Sources are sparse in the sense that each vector is supported on a set muchsmaller than the blocklength. This type of redundancy in terms of sparsityis exploited to achieve effective compression by taking fewer number of linearmeasurements.

• In contrast to lossy data compression, block error probability, instead of distor-tion, is the performance benchmark.

63

• The central problem is to determine how many compressed measurements aresufficient/necessary for recovery with vanishing block error probability as block-length tends to infinity.

• Random coding is employed to show the existence of “good” linear encoders.For instance, when the random sensing matrices are drawn from certain en-sembles (e.g. standard Gaussian), the restricted isometry property (RIP) issatisfied with overwhelming probability and guarantees exact recovery.

On the other hand, there are also significantly different ingredients in compressedsensing in comparison with information theoretic setups:

• Sources are not modeled probabilistically, and the fundamental limits are on aworst-case basis rather than on average. Moreover, block error probability iswith respect to the distribution of the encoding random matrices.

• Real-valued sparse vectors are encoded by real numbers instead of bits.• The encoder is confined to be linear while generally in information-theoretical

problems such as lossless source coding we have the freedom to choose the bestpossible coding scheme.

Departing from the conventional compressed sensing setup, in this chapter we studyfundamental limits of lossless source coding for real-valued memoryless sources withinan information theoretic setup:

• Sources are modeled by random processes. As explained in Section 1.3, thismethod is more flexible to describe source redundancy which encompasses, butis not limited to, sparsity. For example, a mixed discrete-continuous distribu-tion of the form (1.2) is suitable for characterizing linearly sparse vectors.

• Block error probability is evaluated by averaging with respect to the source.• While linear compression plays an important role in our development, our treat-

ment encompasses weaker regularity conditions.

5.1.2 Regularities of coding schemes

Discrete sources have been the sole object in lossless data compression theory. Thereason is at least twofold. First of all, non-discrete sources have infinite entropy, whichimplies that representation with arbitrarily small block error probability requiresarbitrarily large rate. On the other hand, even if we consider encoding analog sourcesby real numbers, the result is still trivial, as R and Rn have the same cardinality.Therefore a single real number is capable of representing a real vector losslessly,yielding a universal compression scheme for any analog source with zero rate andzero error probability.

However, it is worth pointing out that the compression method proposed above isnot robust because the bijection between R and Rn is highly irregular. In fact, neitherthe encoder nor the decoder can be continuous [91, Exercise 6(c), p. 385]. Therefore,using such a compression scheme, a slight change of source realization could result in adrastic change of encoder output and the decoder could suffer from a large distortiondue to a tiny perturbation of the codeword. This disadvantage motivates us to studyhow to compress not only losslessly but also gracefully on Euclidean spaces. In fact

64

some authors have also noticed the importance of regularity in data compression.In [92] Montanari and Mossel observed that the optimal data compression schemeoften exhibits the following inconvenience: codewords tend to depend chaotically onthe data; hence changing a single source symbol leads to a radical change in thecodeword. In [92], a source code is said to be smooth (resp. robust) if the encoder(resp. decoder) is Lipschitz (see Definition 10) with respect to the Hamming distance.The fundamental limits of smooth lossless compression are analyzed in [92] for binarysources via sparse graph codes. In this thesis we focus on real-valued sources withgeneral distributions. Introducing a topological structure makes the nature of theproblem quite different from traditional formulations in the discrete world, and callsfor machinery from dimension theory and geometric measure theory.

The regularity constraints on the coding procedures considered in this chapterare the linearity of compressors and the robustness of decompressors, the latter ofwhich is motivated by the great importance of robust reconstruction in compressedsensing (e.g., [93, 94, 95, 96]), as noise resilience is an indispensable property fordecompressing sparse signals from real-valued measurements. More specifically, thefollowing two robustness constraints are considered:

Definition 9 ((L,∆)-stable). Let (U, dU) and (V, dV ) be metric spaces and T ⊂ U .g : U → V is called (L,∆)-stable on T if for all x, y ∈ T

dU(x, y) ≤ ∆⇒ dV (g(x), g(y)) ≤ L∆. (5.1)

And we say g is ∆-stable if g is (1,∆)-stable.

Definition 10 (Holder and Lipschitz continuity). Let (U, dU) and (V, dV ) be metricspaces. A function g : U → V is called (L, γ)-Holder continuous if there existsL, γ ≥ 0 such that for any x, y ∈ U ,

dV (g(x), g(y)) ≤ LdU(x, y)γ. (5.2)

g is called L-Lipschitz if g is (L, 1)-Holder continuous. g is simply called Lipschitz(resp. β-Holder continuous) if g is L-Lipschitz (resp. (L, β)-Holder continuous) forsome L ≥ 0. The Lipschitz constant of g is defined as

Lip(g) , supx 6=y

dV (g(x), g(y))

dU(x, y). (5.3)

Remark 9. The ∆-stability is a much weaker notion than Lipschitz continuity, be-cause it is easy to see that g is L-Lipschitz if and only if g is (L,∆)-stable for every∆ > 0.

To appreciate the relevance of Definitions 9 and 10 to compressed sensing, let usconsider the following robustness result [96, Theorem 3.2]:

Theorem 29. Let y = Ax0 + e, where A ∈ Rk×n, x0 ∈ Rn and e, y ∈ Rk with‖x0‖0 = S and ‖e‖2 ≤ ∆ ∈ R+. Suppose A satisfies δS ≤ 0.307, where δS is the

65

S-restricted isometry constant of matrix A, defined as the smallest δ > 0 such that

(1− δ) ‖u‖22 ≤ ‖Au‖

22 ≤ (1 + δ) ‖u‖2

2 (5.4)

for all u ∈ Rn such that ‖u‖0 ≤ S. Then the following `2-constrained `1-minimizationproblem

min ‖x‖1

s.t. ‖Ax− y‖2 ≤ ∆.(5.5)

has a unique solution, denoted by x∆, which satisfies

‖x∆ − x0‖2 ≤∆

0.307− δS. (5.6)

In particular, when ∆ = 0, x0 is the unique solution to the following `1-minimizationproblem

min ‖x‖1

s.t. Ax = Ax0.(5.7)

Theorem 29 states that if the sensing matrix A has a sufficiently small restrictedisometry constant, the constrained `1-minimization decoder x∆ (5.5) yields a recon-struction error whose `2-norm is upper bounded proportionally to the `2-norm of thenoise. This is equivalent to the (L,∆)-stability of the decoder with L = 1

0.307−δS .The drawback of this result is that the decoder depends on the noise magnitude∆. To handle the potential case of unbounded noise (e.g., drawn from a Gaussiandistribution), it is desirable to have a robust decoder independent of ∆, which, inview of Remark 9, amounts to the Lipschitz continuity of the decoder. In fact, theunconstrained `1-minimization decoder in (5.7) is Lipschitz continuous. To see this,observe that (5.7) can be equivalently formulated as a linear programming problem.It is known that the optimal solution to any linear programming problem is Lipschitzcontinuous with respect to the parameters [97]. However, it is likely that the Lipschitzconstant blows up as the ambient dimension grows. As we will show later in Theo-rem 32, for discrete-continuous mixtures that are most relevant for compressed sensingproblems, it is possible to construct decoders with bounded Lipschitz constants.

5.2 Definitions and Main results

5.2.1 Lossless data compression

Let the source Xi : i ∈ N be a stochastic process on (X N,F⊗N), with X denotingthe source alphabet and F a σ-algebra over X . Let (Y ,G) be a measurable space,where Y is called the code alphabet. The main objective of lossless data compressionis to find efficient representations for source realizations xn ∈ X n by yk ∈ Yk.

Definition 11. A (n, k)-code for Xi : i ∈ N over the code space (Y ,G) is a pair ofmappings:

66

1. Encoder: fn : X n → Yk that is measurable relative to Fn and Gk.

2. Decoder: gn : Yk → X n that is measurable relative to Gk and Fn.

The block error probability is P gn(fn(Xn)) 6= Xn.

The fundamental limit of lossless source coding is:

Definition 12 (Lossless Data Compression). Let Xi : i ∈ N be a stochastic processon (X N,F⊗N). Define r(ε) to be the infimum of r > 0 such that there exists a sequenceof (n, brnc)-codes over the code space (Y ,G), such that

P gn(fn(Xn)) 6= Xn ≤ ε (5.8)

for all sufficiently large n.

According to the classical discrete almost-lossless source coding theorem, if X iscountable and Y is finite, the minimum achievable rate for any i.i.d. process withdistribution P is

r(ε) =

log |X |log |Y| , ε = 0H(P )log |Y| , 0 < ε < 1,

0, ε = 1.

(5.9)

Using codes over an infinite alphabet, any discrete source can be compressed with zerorate and zero block error probability. In other words, if both X and Y are countablyinfinite, then for all 0 ≤ ε ≤ 1,

r(ε) = 0 (5.10)

for any random process.

5.2.2 Lossless analog compression with regularity conditions

In this subsection we consider the problem of encoding analog sources with analogsymbols, that is, (X ,F) = (R,BR) and (Y ,G) = (R,BR) or ([0, 1],B[0,1]) if boundedencoders are required, where B[0,1] denotes the Borel σ-algebra. As in the countablyinfinite case, zero rate is achievable even for zero block error probability, because thecardinality of Rn is the same for any n [98]. This conclusion holds even if we require theencoder/decoder to be Borel measurable, because according to Kuratowski’s theorem[99, Remark (i), p. 451] every uncountable standard Borel space is isomorphic1 to([0, 1],B[0,1]). Therefore a single real number has the capability of encoding a realvector, or even a real sequence, with a coding scheme that is both universal anddeterministic.

However, the rich topological structure of Euclidean spaces enables us to probethe problem further. If we seek the fundamental limits of not only lossless codingbut “graceful” lossless coding, the result is not trivial anymore. In this spirit, our

1Two measurable spaces are isomorphic if there exists a measurable bijection whose inverse isalso measurable.

67

various definitions share the basic information-theoretic setup where a random vectoris encoded with a function fn : Rn → RbRnc and decoded with gn : RbRnc → Rn withR ≤ 1 such that fn and gn satisfy certain regularity conditions and the probability ofincorrect reconstruction vanishes as n→∞.

Regularity conditions of encoders and decoders are imposed for the sake of bothless complexity and more robustness. For example, although a surjection g from [0, 1]to Rn is capable of lossless encoding, its irregularity requires specifying uncountablymany real numbers to determine this mapping. Moreover, smoothness of the encodingand decoding operation is crucial to guarantee noise resilience of the coding scheme.

Definition 13. Let Xi : i ∈ N be a stochastic process on (RN,B⊗NR ). Define theminimum ε-achievable rate to be the infimum of R > 0 such that there exists asequence of (n, bRnc)-codes (fn, gn) , such that

P gn(fn(Xn)) 6= Xn ≤ ε (5.11)

for all sufficiently large n, and the encoder fn and decoder gn are constrained accordingto Table 5.1. Except for linear encoding where Y = R, it is assumed that Y = [0, 1].

Table 5.1: Regularity conditions of encoder/decoders and corresponding minimumε-achievable rates.

Encoder Decoder Minimum ε-achievable rate

Continuous Continuous R0(ε)

Linear Borel R∗(ε)

Borel Lipschitz R(ε)

Linear Lipschitz R(ε)

Borel ∆-stable R(ε,∆)

Remark 10. Let Sn = xn : gn(fn(xn)) = xn. Then fn is the inverse of gn on Sn.If gn is L-Lipschitz continuous, fn satisfies

‖fn(x)− fn(y)‖ ≥ 1

L‖x− y‖ , (5.12)

which implies the injectivity of fn, a necessary condition for decodability. Further-more, not only does fn assign different codewords to different source symbols, it alsokeeps them sufficiently separated proportionally to their distance. In other words,the encoder fn respects the metric structure of the source alphabet.

Remark 11. In our framework a stable or Lipschitz continuous coding scheme im-plies robustness with respect to noise added at the input of the decompressor, whichcould result from quantization, finite wordlength or other inaccuracies. For example,suppose that the encoder output yk = fn(xn) is quantized by a q-bit uniform quan-tizer, resulting in yk. Using a 2−q-stable coding strategy (fn, gn), we can decode as

68

follows: denote the following nonempty set

D(yk) =zk ∈ Cn :

∥∥zk − yk∥∥∞ ≤ 2−q. (5.13)

where Cn = fn(xn) : xn ∈ Rn, gn(fn(xn)) = xn. Pick any zk in D(yk) and outputxn = gn(zk) as the reconstruction. By the stability of gn and the triangle inequality,we have

‖xn − xn‖∞ ≤ 2−(q−1), (5.14)

i.e., each component in the decoder output will suffer at most twice the inaccuracyof the decoder input. Similarly, an L-Lipschitz decoder with respect to the `∞ normincurs an error no more than L2−q.

5.2.3 Coding theorems

The following general result holds for any input process:

Theorem 30. For any input process and any 0 ≤ ε ≤ 1,

R∗(ε) ≤ RB(ε) ≤ R(ε) ≤ R(ε), (5.15)

where RB(ε) is defined in (3.15).

The right inequality in (5.15) follows from the definitions, since we have R(ε) ≥maxR∗(ε),R(ε). The left and middle inequalities, proved in Sections 5.3.3 and5.4.2 respectively, bridge the lossless Minkowski dimension formulation introduced inDefinition 7 with the lossless analog compression framework in Definition 13. Conse-quently, upper and lower bounds on RB(ε) proved in Section 3.2 provide achievabilitybounds for lossless linear encoding and converse bounds for Lipschitz decoding, re-spectively. Moreover, we have the following surprising result

R∗(ε) ≤ R(ε), (5.16)

which states that robust reconstruction is always harder to pursue than linear com-pression, regardless of the input statistics.

Next we proceed to give results for each of the minimum ε-achievable rates intro-duced in Definition 13 for memoryless sources. First we consider the case where theencoder is restricted to be linear.

Theorem 31 (Linear encoding: general achievability). Suppose that the source ismemoryless. Then

R∗(ε) ≤ d(X) (5.17)

for all 0 < ε < 1, where d(X) is defined in (2.42). Moreover,

1. For all linear encoders (except possibly those in a set of zero Lebesgue measureon the space of real matrices), block error probability ε is achievable.

69

2. The decoder can be chosen to be β-Holder continuous for all 0 < β < 1− d(X)R

,

where R > d(X) is the compression rate.

Proof. See Section 5.3.3.

For discrete-continuous mixtures, the following theorem states that linear encodersand Lipschitz decoders can be realized simultaneously with bounded Lipschitz con-stants. This result also serves as the foundation for our construction in noisy com-pressed sensing (see Theorem 49).

Theorem 32 (Linear encoding: discrete-continuous mixture). Suppose that thesource is memoryless with a discrete-continuous mixed distribution. Then

R∗(ε) = d(X) (5.18)

for all 0 < ε < 1. Moreover, if the discrete part has finite entropy, then for anyrate R > d(X), the decompressor can be chosen to be Lipschitz continuous with re-spect to the `2-norm with an O(1) Lipschitz-constant that only depends on R− d(X).Consequently,

R(ε) = R∗(ε) = d(X) (5.19)


Combining previous theorems yields the following tight result: for discrete-continuous memoryless mixtures of the form in (1.4) such that the discrete componenthas finite entropy, we have

R∗(ε) = R(ε) = RB(ε) = R(ε) = γ (5.20)

for all 0 < ε < 1.

Remark 12. The achievability proof of O(1) Lipschitz constants in Theorem 32works only with the `2 norm, because it depends crucially on the inner productstructure endowed by the `2 norm. The reasons is twofold: first, the Lipschitz constantof a linear mapping with respect to the `2 norm is given by its maximal singularvalue, whose behavior for random measurement matrices is well studied. Second,Kirszbraun’s theorem states that any Lipschitz mapping between two Hilbert spacescan be extended to the whole space with the same Lipschitz constant [100, Theorem1.31, p. 21]. This result fails for general Banach spaces, in particular, for Rn equippedwith any `p-norm (p 6= 2) [100, p. 20]. Although by equivalence of norms on finite-dimensional spaces, it is always possible to extend to a Lipschitz function with alarger Lipschitz constant, which, however, will possibly blow up as the dimensionincreases. Nevertheless, (5.25) shows that even if we allow a sequence of decompressorswith Lipschitz constants that diverges as n → ∞, the compression rate is still lowerbounded by d(X).

70

Remark 13. In the setup of Theorem 32, it is interesting to examine the robustnessof the decoder and its dependence on the measurement rate quantitatively. The proofof Theorem 32 (see (5.189)) shows that for the mixed input distribution (1.4) andany R > γ, the Lipschitz constant of the decoder can be upper bounded by

L =√R exp

1

R− γ

[(H(Pd) +R− γ)(1− γ) +

R

2h( γR

)]+

1

2

, (5.21)

where h(α) , α log 1α

+ (1 − α) log 11−α is the binary entropy function. Note that

(5.21) depends on the excessive rate δ , R− γ as follows: as δ ↓ 0,

• if H(Pd) > 0 (e.g., simple signals (1.3)), then L depends exponentially on 1δ;

• if H(Pd) = 0 (e.g., sparse signals (1.2)), then (5.21) simplifies to

L =√R exp

R

2δh( γR

)+

3

2− γ, (5.22)

which behaves as Θ(

1√δ

).

Remark 14. For the sparse signal model (1.2), the optimal threshold γ in (5.20)can be achieved by the `0-minimization decoder that seeks the sparsest solution com-patible with the linear measurements. This is because whenever the decoder used inthe proof of Theorem 32 (as described in Remark 16) succeeds, the `0-minimizationdecoder also succeeds. However, the `0-minimization decoder is both computationallyexpensive and unstable. The optimality of the `0-minimization decoder sparse signalshas also been observed in [10, Section IV-A1] based on replica heuristics.

Theorem 33 (Linear encoding: achievability for self-similar sources). Suppose thatthe source is memoryless with a self-similar distribution that satisfies the open setcondition. Then

R∗(ε) ≤ d(X) (5.23)

for all 0 < ε < 1.


In Theorems 31 – 32 it has been shown that block error probability ε is achievablefor Lebesgue-a.e. linear encoder. Therefore, choosing any random matrix whosedistribution is absolutely continuous (e.g. i.i.d. Gaussian random matrices) satisfiesblock error probability ε almost surely.

Now, we drop the restriction that the encoder is linear, allowing very generalencoding rules. Let us first consider the case where both the encoder and decoder areconstrained to be continuous. It turns out that zero rate is achievable in this case.

Theorem 34 (Continuous encoder and decoder). For general sources,

R0(ε) = 0 (5.24)

for all 0 < ε ≤ 1.

71

Proof. Since (Rn,BRn) and ([0, 1],B[0,1]) are Borel isomorphic, there exist Borel mea-surable f : Rn → [0, 1] and g : [0, 1] → Rn, such that g = f−1. By Lusin’s theorem[101, Theorem 7.10], there exists a compact set K ⊂ Rn such that f restrictedon K is continuous and P Xn /∈ K < ε. Since K is compact and f is injectiveon K, f is a homeomorphism from K to T , f(K). Hence g : T → Rn is con-tinuous. Since both K and T are closed, by the Tietze extension theorem [91], fand g can be extended to continuous f ′ : Rn → [0, 1] and g′ : [0, 1] → Rn respec-tively. Using f ′ and g′ as the new encoder and decoder, the error probability satisfiesP g′(f ′(Xn)) 6= Xn ≤ P Xn /∈ K < ε.

Employing similar arguments as in the proof of Theorem 34, we see that imposingadditional continuity constraints on the encoder (resp. decoder) has almost no impacton the fundamental limit R(ε) (resp. R∗(ε)). This is because a continuous encoder(resp. decoder) can be obtained at the price of an arbitrarily small increase of errorprobability, which can be chosen to vanish as n grows.

Theorems 35 - 37 deal with Lipschitz decoding in Euclidean spaces.

Theorem 35 (Lipschitz decoding: General converse). Suppose that the source ismemoryless. If d(X) <∞, then

R(ε) ≥ d(X) (5.25)

for all 0 < ε < 1.


Theorem 36 (Lipschitz decoding: Achievability for discrete/continuous mixture).Suppose that the source is memoryless with a discrete-continuous mixed distribution.Then

R(ε) = d(X) (5.26)

for all 0 < ε < 1.


For sources with a singular distribution, in general there is no simple answerdue to their fractal nature. For an important class of singular measures, namely self-similar measures generated from i.i.d. digits (e.g. generalized Cantor distribution), theinformation dimension turns out to be the fundamental limit for lossless compressionwith Lipschitz decoder.

Theorem 37 (Lipschitz decoding: Achievability for self-similar measures). Supposethat the source is memoryless and bounded, and its M-ary expansion consists of in-dependent identically distributed digits. Then

R(ε) = d(X) (5.27)

for all 0 < ε < 1. Moreover, if the distribution of each bit is equiprobable on itssupport, then (5.27) holds for ε = 0

72


Example 6. As an example, we consider the setup in Theorem 37 with M = 3 andP = p, 0, q, where p+q = 1. The associated invariant set is the middle third Cantorset C [18] and X is supported on C. The distribution of X, denoted by µ, is calledthe generalized Cantor distribution [35]. In the ternary expansion of X, each digit isindependent and takes value 0 and 2 with probability p and q respectively. Then byTheorem 37, for any 0 < ε < 1, R(ε) = h(p)

log 3. Furthermore, when p = 1/2, µ coincides

with the ‘uniform’ distribution on C, i.e., the standard Cantor distribution. Hence wehave a stronger result that R(0) = log3 2 ≈ 0.63, i.e., exact lossless compression canbe achieved with a Lipschitz continuous decompressor at the rate of the informationdimension.

We conclude this section by discussing the fundamental limit of stable decoding(respect to the `∞ norm), which is given by the following tight result.

Theorem 38 (Stable decoding). Let the underlying metric be the `∞ distance. Sup-pose that the source is memoryless. Then for all 0 < ε < 1,

lim sup∆↓0

R(ε,∆) = d(X), (5.28)

that is, the minimum ε-achievable rate such that for all sufficiently small ∆ thereexists a ∆-stable coding strategy is given by d(X).

Proof. See Section 5.6.

5.3 Lossless Linear Compression

In this section we analyze lossless compression with linear encoders, which are thebasic elements in compressed sensing. Capitalizing on the approach of Minkowski-dimension compression developed in Chapter 3, we obtain achievability results forlinear compression. For memoryless sources with a discrete-continuous mixed distri-bution, we also establish a converse (Theorem 32) which shows that the informationdimension is the fundamental limit of lossless linear encoding.

5.3.1 Minkowski-dimension compression and linear compres-sion

The following theorem establishes the relationship between Minkowski-dimensioncompression and linear data compression.

Theorem 39 (Linear encoding: general achievability). For general sources,

R∗(ε) ≤ RB(ε), (5.29)

Moreover,

73

1. For all linear encoders (except possibly those in a set of zero Lebesgue measureon the space of real matrices), block error probability ε is achievable.

2. For all ε′ > ε and

0 < β < 1− RB(ε)

R, (5.30)

where R > RB(ε) is the compression rate, there exists a β-Holder continuousdecoder that achieves block error probability ε′.

Consequently, in view of Theorem 39, the results on RB(ε) for memoryless sourcesin Theorems 14 – 16 yield the achievability results in Theorems 31 – 33 respectively.Holder exponents of the decoder can be found by replacing RB(ε) in (5.30) by itsrespective upper bound.

Remark 15. For discrete-continuous sources, the achievability in Theorem 32 can beshown directly without invoking the general result in Theorem 39. See Remark 18.From the converse proof of Theorem 32, we see that effective compression can beachieved with linear encoders, i.e., R∗(ε) < 1, only if the source distribution is notabsolutely continuous with respect to Lebesgue measure.

Linear embedding of low dimensional subsets in Banach spaces was previouslystudied in [19, 21, 20] etc. in a non-probabilistic setting. For example, [21, Theorem1.1] showed that: for a subset S of a Banach space with dimBS < k/2, there existsa bounded linear function that embeds S into Rk. Here in a probabilistic setup, theembedding dimension can be improved by a factor of two, in view of the followingfinite-dimensional version of Theorem 39, whose proof is almost identical.

Theorem 40. Let Xn be a random vector with dimε

B(PXn) ≤ k. Let m > k. Then forLebesgue almost every A ∈ Rm×n, there exists a

(1− k

m

)-Holder continuous function

g : Rm → Rn, such that P g(AXn) 6= Xn ≥ 1− ε.

Remark 16. In Theorem 40, the decoder can be chosen as follows: by definition ofdim

ε

B, there exists U ⊂ Rn, such that dimB(U) ≤ k. Then if xn is the unique solutionto the linear equation Axn = yk in U , the decoder outputs g(yk) = xn; otherwiseg(yk) = 0.

Particularizing Theorem 40, we obtain a non-asymptotic result of lossless linearcompression for k-sparse vectors (i.e., with no more than k non-zero components),which is relevant to compressed sensing.

Corollary 2. Denote the collection of all k-sparse vectors in Rn by

Σk = xn ∈ Rn : ‖xn‖0 ≤ k. (5.31)

Let µ be a σ-finite Borel measure on Rn. Then given any l ≥ k+ 1, for Lebesgue-a.e.l×n real matrix H, there exists a Borel function g : Rl → Rn, such that g(Hxn) = xn

for µ-a.e. xn ∈ Σk. Moreover, when µ is finite, for any ε > 0 and 0 < β < 1− kl, there

exists a l×n matrix H∗ and g∗ : Rl → Rn, such that µxn ∈ Σk : g∗(H∗xn) 6= xn ≤ εand g∗ is β-Holder continuous.

74

Remark 17. The assumption that the measure µ is σ-finite is essential, becausethe validity of Corollary 2 hinges upon Fubini’s theorem, where σ-finiteness is anindispensable requirement. Consequently, if µ is the distribution of a k-sparse randomvector with uniformly chosen support and Gaussian distributed non-zero entries, weconclude that all k-sparse vectors can be linearly compressed except for a subsetof zero measure under µ. On the other hand, if µ is the counting measure on Σk,Corollary 2 no longer applies because that µ is not σ-finite. In fact, if l < 2k, nolinear encoder from Rn to Rl works for every k-sparse vector, even if no regularityconstraint is imposed on the decoder. This is because no l×n matrix acts injectivelyon Σk. To see this, define C − C = x− y : x, y ∈ C for C ⊂ Rn. Then for any l × nmatrix H,

(Σk − Σk) ∩Ker(H) = Σ2k ∩Ker(H) 6= 0. (5.32)

Hence there exist two k-sparse vectors that have the same image under H.On the other hand, l = 2k is sufficient to linearly compress all k-sparse vectors,

because (5.32) holds for Lebesgue-a.e. 2k × n matrix H. To see this, choose H to bea random matrix with i.i.d. entries according to some continuous distribution (e.g.,Gaussian). Then (5.32) holds if and only if all 2k × 2k submatrices formed by 2kcolumns of H are invertible. This is an almost sure event, because the determinantof each of the

(n2k

)submatrices is an absolutely continuous random variable. The

sufficiency of l = 2k is a bit stronger than the linear embedding result mentioned inthe discussion before Theorem 40, which gives l = 2k+1. For an explicit construction

of such a 2k×n matrix, we can choose H to be the cosine matrix Hij = cos(

(i−1)jπn

)(see Appendix C.1).

5.3.2 Auxiliary results

Let 0 < k < n. Denote by G(n, k) the Grassmannian manifold [35] consisting ofall k-dimensional subspaces of Rn. For V ∈ G(n, k), the orthogonal projection fromRn to V defines a linear mapping projV : Rn → Rn of rank k. The technique we use inthe achievability proof of linear analog compression is to use the random orthogonalprojection projV as the encoder, where V is distributed according to the invariantprobability measure on G(n, k), denoted by γn,k [35]. The relationship between γn,kand the Lebesgue measure on Rk×n is shown in the following lemma.

Lemma 14 ([35, Exercise 3.6]). Denote the rows of a k×n matrix H by H1, . . . , Hk,the row span of H by Im(HT), and the volume of the unit `2-ball in Rn by α(n). Set

BL = H ∈ Rk×n : ‖Hi‖2 ≤ L, i = 1, . . . , k. (5.33)

Then for S ⊂ G(n, k) measurable, i.e., a collection of k-dimensional subspaces of Rn,

γn,k(S) = α(n)−kLebH ∈ B1 : Im(HT) ∈ S. (5.34)

75

The following result states that a random projection of a given vector is not toosmall with high probability. It plays a central role in estimating the probability of“bad” linear encoders.

Lemma 15 ([35, Lemma 3.11]). For any xn ∈ Rn\0,

γn,kV : ‖projV xn‖2 ≤ δ ≤ 2nδk

α(n) ‖xn‖k2(5.35)

To show the converse part of Theorem 32 we will invoke the Steinhaus Theoremas an auxiliary result:

Lemma 16 (Steinhaus [102]). For any measurable set C ⊂ Rn with positive Lebesguemeasure, there exists an open ball centered at 0 contained in C − C.

Lastly, we give a characterization of the fundamental limit of lossless linear en-coding as follows.

Lemma 17. R∗(ε) is the infimum of R > 0 such that for sufficiently large n, thereexists a Borel set Sn ⊂ Rn and a linear subspace Hn ⊂ Rn of dimension at leastd(1−R)ne, such that

(Sn − Sn) ∩Hn = 0. (5.36)

Proof. (Converse): Suppose R is ε-achievable with linear encoders, i.e. R ≥ R∗(ε).Since any linear function between two finite-dimensional linear spaces is uniquelyrepresented by a matrix, by Definition 13, for sufficiently large n, there exists amatrix H ∈ RbRnc×n and g : RbRnc → Rn such that

P g(HXn) = Xn ≥ 1− ε. (5.37)

Thus there exists Sn ⊂ Rn such that P Xn ∈ Sn ≥ 1− ε and H as a linear mappingrestricted to Sn is injective, which implies that for any xn 6= yn in Sn, Hxn 6= Hyn,that is, xn − yn /∈ Ker(H). Therefore,

(Sn − Sn) ∩Ker(H) = 0. (5.38)

Note that dim Ker(H) ≥ n− bRnc, hence (5.36) holds.(Achievability): Suppose that such a sequence of Sn and Hn exists then, by defi-

nition of R∗(ε), R ≥ R∗(ε).

5.3.3 Proofs

Proof of Theorem 39. We first show (5.29). Fix 0 < δ′ < δ arbitrarily. Let R =RB(ε) + δ, k = bRnc and k′ = b(RB(ε) + δ′)nc. We show that there exists a matrixHn ∈ Rn×n of rank k and gn : Im(Hn)→ Rn Borel measurable such that

P gn(HnXn) = Xn ≥ 1− ε (5.39)

76

for sufficiently large n.By definition of RB(ε), there exists N0 ∈ N such that for all n ≥ N0, there exists

a compact Un ⊂ Rn such that P Xn ∈ Un ≥ 1 − ε and dimBUn ≤ k′. Given anencoding matrix H, define the decoder gH : Im(H)→ Rn as

gH(yk) = min H−1(yk) ∩ Un, (5.40)

where the min is taken componentwise2. Since H−1(yk) is closed and Un is compact,H−1(yk) ∩ Un is compact. Hence gH is well-defined.

Next consider a random orthogonal projection matrix Φ = projV independent ofXn, where V ∈ G(n, k) is a random k-dimensional subspace distributed according tothe invariant measure γn,k on G(n, k). We show that for all n ≥ N0,

P gΦ(ΦXn) 6= Xn ≤ ε, (5.41)

which implies that there exists at least one realization of Φ that satisfies (5.39). Tothat end, we define

pe(H) = P Xn ∈ Un, gH(HXn) 6= Xn (5.42)

and use the union bound:

P gΦ(ΦXn) 6= Xn ≤ P Xn /∈ Un+ E [pe(Φ)] , (5.43)

where the first term P Xn /∈ Un ≤ ε. Next we show that the second term is zero.Let U(xn) = Un − xn. Then

E [pe(Φ)] =

∫Un

P gΦ(Φxn) 6= xnµn(dxn) (5.44)

≤∫Un

P ∃yn ∈ U(xn) : Φyn = 0µn(dxn) (5.45)

=

∫Un

P Ker(Φ) ∩ U(xn) 6= 0µn(dxn), (5.46)

We show that for all xn ∈ Un, P Ker(Φ) ∩ U(xn) 6= 0 = 0. To this end, let

0 < β <δ − δ′

R=k − k′

k. (5.47)

Define

Tj(xn) = yn ∈ U(xn) : ‖yn‖2 ≥ 2−jβ, (5.48)

Qj = H : ∀yn ∈ Tj(xn), ‖Hyn‖2 ≥ 2−j. (5.49)

2Alternatively, we can use any other tie-breaking strategy as long as Borel measurability issatisfied.

77

Then ⋃j≥1

Tj(xn) = U(xn)\0. (5.50)

Observe that H ∈⋂j≥J Qj implies that

‖Hyn‖2 ≥ 2−(J+1) ‖yn‖1β

2 , ∀yn ∈ U(xn). (5.51)

Therefore H ∈ Qj for all but a finite number of j’s if and only if

‖Hyn‖2 ≥ C(H, xn) ‖yn‖1β

2 , ∀yn ∈ U(xn) (5.52)

for some C(H, xn) > 0.Next we show that Φ ∈ Qj for all but a finite number of j’s with probability

one. Cover U(xn) with 2−(j+1)-balls. The centers of those balls that intersect Tj(xn)

are denoted by wi,j, i = 1, . . . ,Mj. Pick yi,j ∈ Tj(xn) ∩ B(wi,j, 2

−(j+1)). ThenB(wi,j, 2

−(j+1)) ⊂ B(yi,j, 2−j), hence B(yi,j, 2

−j) : i = 1, . . . ,Mj cover Tj(xn). Sup-

pose ‖Hyi,j‖2 ≥ 2−(j−1), then for any z ∈ B(yi,j, 2−j),

‖Hz‖2 ≥‖Hyi,j‖2 − ‖H(yi,j − z)‖2 (5.53)

≥‖Hyi,j‖2 − ‖yi,j − z‖2 (5.54)

≥2−j, (5.55)

where (5.54) follows because H is an orthogonal projection. Thus ‖Hyi,j‖2 ≥ 2−(j−1)

for all i = 1, . . . ,Mj implies that H ∈ Qj. Therefore by the union bound,

∑j≥1

PΦ ∈ Qc

j

≤∑j≥1

Mj∑i=1

P‖Φyi,j‖2 ≤ 2−(j−1)

. (5.56)

By Lemma 15,

P‖Φyi,j‖2 ≤ 2−(j−1)

= γn,k(V : ‖projV yi,j‖2 ≤ 2−(j−1)) (5.57)

≤ 2n−(j−1)k

α(n) ‖yi,j‖k2(5.58)

≤ 2n+k(1−j+jβ)

α(n), (5.59)

where (5.59) is due to ‖yi,j‖2 ≥ 2−jβ, because yi,j ∈ Tj(xn). Since dimBUn < k′, thereis a constant C1 such that NUn(2−j−1) ≤ C12jk

′. Since U(xn) is a translation of Un,

it follows thatMj ≤ C12jk

′. (5.60)

78

Thus ∑j≥1

PΦ ∈ Qc

j

≤ C1α(n)−12n+k

∑j≥1

2j(k′−(1−β)k) (5.61)

< ∞, (5.62)

where

• (5.61): by substituting (5.59) and (5.60) into (5.56);• (5.62): by (5.47).

Therefore by the Borel-Cantelli lemma, Φ ∈ Qj for all but a finite number of j’s withprobability one. Hence

P‖Φyn‖2 ≥ C(Φ, xn) ‖yn‖

1β

2 , ∀yn ∈ U(xn)

= 1, (5.63)

which implies that for any xn ∈ Un,

P Ker(Φ) ∩ U(xn) = 0 = 1. (5.64)

In view of (5.46),

E [pe(Φ)] = P Xn ∈ Un, gΦ(ΦXn) 6= Xn = 0, (5.65)

whence (5.41) follows. This shows the ε-achievability of R. By the arbitrariness ofδ > 0, (5.29) is proved.

Now we show thatP gA(AXn) 6= Xn ≤ ε (5.66)

holds for all A ∈ Rk×n except possibly on a set of zero Lebesgue measure, where gA

is the corresponding decoder for A defined in (5.40). Note that,

A : P gA(AXn) 6= Xn > ε ⊂ A : pe(A) > 0 (5.67)

=A : pe(projIm(AT)) > 0

, (5.68)

where

• (5.67): by (5.43).• (5.68): by (5.46) and Ker(A) = Ker(projIm(AT)).

DefineS , V ∈ G(n, k) : pe(projV ) > 0. (5.69)

Recalling BL defined in (5.33), we have

LebA ∈ B1 : pe(projIm(AT)) > 0

= LebA ∈ B1 : Im(AT) ∈ S (5.70)

= α(n)kγn,k(S) (5.71)

= 0, (5.72)

where

79

• (5.70): by (5.68) and since Im(AT) ∈ G(n, k) holds Lebesgue-a.e.• (5.71): by Lemma 14.• (5.72): by (5.65).

Observe that (5.72) implies that for any L,

LebA ∈ BL : pe(projIm(AT)) > 0

= 0. (5.73)

Since Rn×k =⋃L≥1 BL, in view of (5.68) and (5.73), we conclude that (5.66) holds

Lebesgue-a.e.Finally we show that for any ε′ > ε, there exists a sequence of matrices and

β-Holder continuous decoders that achieves compression rate R and block error prob-ability ε′. Since Φ ∈ Qj for all but a finite number of j’s a.s., there exists a Jn(independent of xn), such that

P

Φ ∈

⋂j≥Jn

Qj

≥ 1− ε′ + ε. (5.74)

Thus by (5.51), for any xn ∈ Un,

P‖Φyn‖2 ≥ 2−(Jn+1) ‖yn‖

1β

2 , ∀yn ∈ U(xn)

≥ 1− ε′ + ε. (5.75)

Integrating (5.75) with respect to µn(dxn) on Un and by Fubini’s theorem, we have

PXn ∈ Un, ‖Φ(yn −Xn)‖2 ≥ 2−Jn+1 ‖yn −Xn‖

1β

2 , ∀yn ∈ Un

=

∫PXn ∈ Un, ‖H(yn −Xn)‖2 ≥ 2−Jn+1 ‖yn −Xn‖

1β

2 , ∀yn ∈ UnPΦ(dH)

(5.76)

≥ 1− ε′. (5.77)

Hence there exists Sn ⊂ Un and an orthogonal projection matrix Hn of rank k, suchthat P Xn ∈ Sn ≥ 1− ε′ and for all xn, yn ∈ Sn,

‖Hn(yn − xn)‖2 ≥ 2−(Jn+1) ‖yn − xn‖1β

2 (5.78)

for all xn, yn ∈ Sn. Therefore3 H−1n |Hn(Sn) is (2β(Jn+1), β)-Holder continuous. By

the extension theorem of Holder continuous mappings [103], H−1n can be extended to

gn : Rk → Rn that is β-Holder continuous. Then

P gn(HnXn) 6= Xn ≤ ε′. (5.79)

Recall from (5.47) that 0 < β < R−RB(ε)−δ′R

. By the arbitrariness of δ′, (5.30) holds.

3 f |A denotes the restriction of f on the subset A.

80

Remark 18. Without recourse to the general result in Theorem 39, the achievabilityfor discrete-continuous sources in Theorem 32 can be proved directly as follows. In(5.43), choose

Un = xn ∈ Rn : |spt(xn)| ≤ (ρ+ δ/2)n , (5.80)

and consider Φ whose entries are i.i.d. standard Gaussian (or any other absolutelycontinuous distribution on R). Using linear algebra, it is straightforward to showthat the second term in (5.43) is zero. Thus the block error probability vanishes sinceP Xn /∈ Un → 0.

Proof of Corollary 2. It is sufficient to consider l = k+1. We show that for Lebesgue-a.e. l × n matrix H, µxn ∈ Σk : gH(Hxn) 6= xn = 0. Since gH(Hxn) = xn if andonly if gH(Hαxn) = αxn for any α > 0, it is sufficient to show

µxn ∈ Sk : gH(Hxn) 6= xn = 0, (5.81)

where Sk , Σk ∩Bn2 (0, 1).

Consider the l×n random projection matrix Φ defined in the proof of Theorem 39.Denote by PΦ the distribution of Φ and by µ×PΦ the product measure on Rn×Rl×n.Fix 0 < β < 1 − k

l. Observe that Σk is the union of

(nk

)k-dimensional subspaces of

Rn, and by Lemma 8, dimBSk = k. Following the same arguments leading to (5.77),we have

µ× PΦxn ∈ Sk,H : ‖H(yn − xn)‖2 ≥ C(H, xn) ‖yn − xn‖1β

2 , ∀yn ∈ Sk = µ(Sk).(5.82)

By Fubini’s theorem and Lemma 14, (5.81) holds for Lebesgue-a.e. H.When µ is finite, a β-Holder continuous decoder that achieves block error prob-

ability ε can be obtained using the same construction procedure as in the proof ofTheorem 39.

Finally we complete the proof of Theorem 32 by proving the converse.

Converse proof of Theorem 32. Let the distribution of X be defined as in (3.70). Weshow that for any ε < 1, R∗(ε) ≥ ρ. Since R∗(ε) ≥ 0, assume ρ > 0. Fix anarbitrary 0 < δ < ρ. Suppose R = ρ − δ is ε-achievable. Let k = b(ρ− δ)nc andk′ = b(ρ− δ/2)nc. By Lemma 17, for sufficiently large n, there exist a Borel set Sn

and a linear subspace Hn ⊂ Rn, such that P Xn ∈ Sn ≥ 1− ε, Sn−Sn ∩Hn = 0and dimHn ≥ n− k.

If ρ = 1, µ = µc is absolutely continuous with respect to Lebesgue measure.Therefore Sn has positive Lebesgue measure. By Lemma 16, Sn − Sn contains anopen ball in Rn. Hence Sn − Sn ∩Hn = 0 cannot hold for any subspace Hn withpositive dimension. This proves R∗(ε) ≥ 1. Next we assume that 0 < ρ < 1.

LetTn = xn ∈ Rn : |spt(xn)| ≥ k′ (5.83)

and Gn = Sn∩Tn. By (3.72), for sufficiently large n, P Xn /∈ Tn ≤ (1− ε)/2, henceP Xn ∈ Gn ≥ (1− ε)/2.

81

Next we decompose Gn according to the generalized support of xn:

Gn =⋃

U⊂1,...,n|U |≥k′

CU , (5.84)

where we have denoted the disjoint subsets

CU = xn ∈ Sn : spt(xn) = U . (5.85)

Then ∑U⊂1,...,n|U |≤k′

P Xn ∈ CU = P Xn ∈ Gn (5.86)

≥ (1− ε)/2 > 0. (5.87)

So there exists U ⊂ 1, . . . , n such that |U | ≤ k′ and P Xn ∈ CU > 0.Next we decompose each CU according to xnUc which can only take countably many

values. Let j = |U |. For y ∈ An−j, let

By = xn : xn ∈ CU , xnUc = y, (5.88)

Dy = xnU : xn ∈ By. (5.89)

Then CU can be written as a disjoint union of By:

CU =⋃

y∈An−jBy. (5.90)

Since P Xn ∈ CU > 0, there exists y ∈ An−j such that P Xn ∈ By > 0.Note that

P Xn ∈ By = µn(By) (5.91)

= ρj(1− ρ)n−jµjc(Dy)

n−j∏i=1

µd(yi) (5.92)

> 0. (5.93)

Therefore µjc(Dy) > 0. Since µjc is absolutely continuous with respect to Lebesguemeasure on Rj, Dy has positive Lebesgue measure. By Lemma 16, Dy −Dy containsan open ball B(0, r) for some r > 0. Therefore we have

K ⊂ By −By ⊂ Sn − Sn, (5.94)

whereK = xn ∈ Rn : xnUc = 0, xnU ∈ B(0, r). HenceK contains j linear independentvectors, denoted by a1, . . . , aj. Let b1, . . . , bm be a basis for Hn, where m ≥ n−kby assumption. Since j = |U | ≥ k′ > k, we conclude that a1, . . . , aj, b1, . . . , bm are

82

linearly dependent. Therefore

m∑l=1

βlbl =

j∑l=1

αlal 6= 0 (5.95)

where αi 6= 0 and βl 6= 0 for some i and l. If we choose those nonzero coefficientssufficiently small, then

∑jl=1 αlal ∈ K and

∑ml=1 βlbl ∈ Hn since Hn is a linear

subspace. This contradicts Sn − Sn ∩Hn = 0. Thus R∗(ε) ≥ ρ− δ, and R∗(ε) ≥ ρfollows from the arbitrariness of δ.

5.4 Lossless Lipschitz Decompression

In this section we study the fundamental limit of lossless compression with Lip-schitz decoders. To facilitate the discussion, we first introduce several importantconcepts from geometric measure theory. Then we proceed to give proofs of Theo-rems 35 – 37.

5.4.1 Geometric measure theory

Geometric measure theory [104, 35] is an area of analysis studying the geometricproperties of sets (typically in Euclidean spaces) through measure theoretic methods.One of the core concepts in this theory is rectifiability, a notion of smoothness or reg-ularity of sets and measures. Basically a set is rectifiable if it is the image of a subsetof a Euclidean space under some Lipschitz function. Rectifiable sets admit a smoothanalog coding strategy. Therefore, lossless compression with Lipschitz decoders boilsdown to finding high-probability subsets of source realizations that are rectifiable. Incontrast, the goal of conventional almost-lossless data compression is to show concen-tration of probability on sets of small cardinality. This characterization enables us touse results from geometric measure theory to study Lipschitz coding schemes.

Definition 14 (Hausdorff measure and dimension). Let s > 0 and A ⊂ Rn. Define

Hsδ(A) = inf

∑i

diam(Ei)s : A ⊂

⋃i

Ei, diam(Ei) ≤ δ

, (5.96)

where diam(Ei) = sup‖x− y‖2 : x, y ∈ Ei. Define the s-dimensional Hausdorffmeasure on Rn by

Hs(A) = limδ↓0Hsδ(A), (5.97)

The Hausdorff dimension of A is defined by

dimH(A) = infs : Hs(A) <∞. (5.98)

83

Hausdorff measure generalizes both the counting measure and Lebesgue measureand provides a non-trivial way to measure low-dimensional sets in a high-dimensionalspace. When s = n, Hn is just a rescaled version of the usual n-dimensional Lebesguemeasure [35, 4.3]; when s = 0, H0 reduces to the counting measure. For 0 < s < n,Hs gives a non-trivial measure for sets of Hausdorff dimension s in Rn, because ifdimHA < s, Hs(A) = 0; if dimHA > s, Hs(A) = ∞. As an example, consider n = 1and s = log3 2. Let C be the middle-third Cantor set in the unit interval, which haszero Lebesgue measure. Then dimHC = s and Hs(C) = 1 [18, 2.3].

Definition 15 (Rectifiable sets, [104, 3.2.14]). E ⊂ Rn is called m-rectifiable if thereexists a Lipschitz mapping from some bounded set in Rm onto E.

Definition 16 (Rectifiable measures, [35, Definition 16.6] ). Let µ be a measure onRn. µ is called m-rectifiable if µ Hm and there exists a µ-a.s. set E ⊂ Rn that ism-rectifiable.

Several useful facts about rectifiability are presented as follows:

Lemma 18 ([104]). 1. An l-rectifiable set is also m-rectifiable for m ≥ l.

2. The Cartesian product of an m-rectifiable set and an l-rectifiable set is (m+ l)-rectifiable.

3. The finite union of m-rectifiable sets is m-rectifiable.

4. Countable sets are 0-rectifiable.

Using the notion of rectifiability, we give a sufficient condition for the ε-achievability of Lipschitz decompression:

Lemma 19. R(ε) ≤ R if there exists a sequence of bRnc-rectifiable sets Sn ⊂ Rn with

P Xn ∈ Sn ≥ 1− ε (5.99)

for all sufficiently large n.


Definition 17 (k-dimensional density, [35, Definition 6.8]). Let µ be a measure onRn. The k-dimensional upper and lower densities of µ at x are defined as

Dk(µ, x) = lim supr↓0

µ(B(x, r))

rk, (5.100)

Dk(µ, x) = lim infr↓0

µ(B(x, r))

rk, (5.101)

If Dk(µ, x) = Dk(µ, x), the common value is called the k-dimensional density of µ atx, denoted by Dk(µ, x).

84

The following important result in geometric measure theory gives a density char-acterization of rectifiability for Borel measures.

Theorem 41 ([105, Theorem 5.6], Preiss Theorem). A σ-finite Borel measure on Rn

is m-rectifiable if and only if 0 < Dm(µ, x) <∞ for µ-a.e. x ∈ Rn.

Recalling the expression for information dimension d(PX) in (2.7), we see that forthe information dimension of a measure to be equal to m it requires that the exponentof the average measure of ε-balls equals m, whereas m-rectifiability of a measurerequires that the measure of almost every ε-ball scales as O (εm), a much strongercondition than the existence of information dimension. Obviously, if a probabilitymeasure µ is m-rectifiable, then d(µ) = m.

5.4.2 Converse

In view of the lossless Minkowski dimension compression results developed inSection 3.2, the general converse in Theorem 35 is rather straightforward to show.We prove the following non-asymptotic converse of Lipschitz decoding, which impliesTheorem 35 in particular:

Theorem 42. For any random vector Xn, if there exists a Borel measurable f :Rn → Rk and a Lipschitz continuous g : Rk → Rn such that P g(f(Xn)) 6= Xn ≤ ε,then

k ≥ dimε

B(PXn) (5.102)

≥ d(Xn)− εn. (5.103)

In view of (5.102) and the definition of RB(ε) in (3.14), we immediately have thefollowing general inequality

R(ε) ≥ RB(ε). (5.104)

If the source is memoryless and d(X) < ∞, then it follows from Theorem 14 thatR(ε) ≥ d(X). This completes the proof of Theorem 35.

Note that using the weaker result in (5.103) yields the following converse forgeneral input processes:

R(ε) ≥ lim supn→∞

d(Xn)

n− ε, (5.105)

which, for memoryless inputs, becomes the following weak converse:

R(ε) ≥ d(X)− ε. (5.106)

To get rid of ε in (5.106), it is necessary to use the strong converse in Theorem 14.

85

Proof of Theorem 42. To prove (5.102), denote C = f(xn) ∈ Rn : g(f(xn)) = xn ⊂Rk. Then

k ≥ dimB(C) (5.107)

≥ dimB(g(C)) (5.108)

≥ dimε

B(PXn), (5.109)

where

• (5.107): Minkowski dimension never exceeds the ambient dimension;• (5.108): Minkowski dimension never increases under Lipschitz mappings [35,

Exercise 7.6, p.108] (see Lemma 20 below for a proof);• (5.109): by P Xn ∈ g(C) ≥ 1− ε and (3.4).

It remains to show (5.103). By definition of dimε

B, for any δ > 0, there exists Esuch that PXn(E) ≥ 1− ε and dimB(E) ≥ dim

ε

B(PXn)− δ. Since PXn can be writtenas a convex combination of PXn|Xn∈E and PXn|Xn /∈E, applying Theorem 5 yields

d(Xn) ≤ d(PXn|Xn∈E)PXn(E) + d(PXn|Xn /∈E)(1− PXn(E)) (5.110)

≤ dimε

B(PXn)− δ + εn, (5.111)

where (5.111) holds because the information dimension of any distribution is upperbounded by the Minkowski dimension of its support [18]. Then the desired (5.103)follows from the arbitrariness of δ.

Lemma 20. Let T ⊂ Rk be bounded and f : T → Rn be a Lipschitz mapping. ThendimB(f(T )) ≤ dimB(T ).

Proof. Note that

dimBT = lim supδ↓0

logNT (δ)

log 1δ

. (5.112)

By definition of NT (δ), there exists xi : i = 1, . . . , NT (δ), such that T is covered bythe union of B(xi, δ), i = 1, . . . , NT (δ). Setting L = Lip(f), we have

f(T ) ⊂NT (δ)⋃i=1

f(B(xi, δ)) ⊂NT (δ)⋃i=1

B(f(xi), Lδ), (5.113)

which implies that Nf(T )(Lδ) ≤ NT (δ). Dividing both sides by log 1δ

and sending

δ → 0 yield dimB(f(T )) ≤ dimB(T ).

5.4.3 Achievability for finite mixture

We first prove a general achievability result for finite mixtures, a corollary of whichapplies to discrete-continuous mixed distributions in Theorem 36.

86

Theorem 43 (Achievability of Finite Mixtures). Let the distribution µ of X be amixture of finitely many Borel probability measures on R, i.e.,

µ =N∑i=1

ρiµi, (5.114)

where ρ1, . . . , ρN is a probability mass function. If Ri is εi-achievable with Lipschitzdecoders for µi, i = 1, . . . , N , then R is ε-achievable for µ with Lipschitz decoders,where

ε =N∑i=1

εi, (5.115)

R =N∑i=1

ρiRi. (5.116)

Proof. By induction, it is sufficient to show the result for N = 2. Denote ρ1 = ρ = 1−ρ2. Let Wi be a sequence of i.i.d. binary random variables with P Wi = 1 = ρ. LetXi be a i.i.d. sequence of real-valued random variables, such that the distribution ofeach Xi conditioned on the events Wi = 1 and Wi = 0 are µ1 and µ2 respectively.Then Xi : i ∈ N is a memoryless process with common distribution µ. Since theclaim of the theorem depends only on the probability law of the source, we base ourcalculation of block error probability on this specific construction.

Fix δ > 0. Since R1 and R2 are achievable for µ1 and µ2 respectively, by Lemma 19,there exists N1 such that for all n > N1, there exists Sn1 , S

n2 ⊂ Rn, with µn1 (Sn1 ) ≥

1− ε1, µn2 (Sn2 ) ≥ 1− ε2 and Sn1 is bR1nc-rectifiable and Sn2 is bR2nc-rectifiable. Let

Wn = wn ∈ 0, 1n : (ρ− δ)n ≤ ‖wn‖0 ≤ (ρ+ δ)n , (5.117)

By WLLN, P W n ∈ Wn → 1. Hence for any ε′ > 0, there exists N2, such that forall n > N2,

P W n ∈ Wn ≥ 1− ε′. (5.118)

Let

n > max

N1

ρ− δ,

N1

1− ρ− δ,N2

. (5.119)

Define

Sn = xn ∈ Rn : ∃m,T, (ρ− δ)n ≤ m ≤ (ρ+ δ)n,

T ⊂ 1, . . . , n, |T | = m,xT ∈ Sm1 , xT c ∈ Sn−m2 .(5.120)

87

Next we show that Sn is b(R + 2δ)nc-rectifiable. For all (ρ− δ)n ≤ m ≤ (ρ+ δ)n, itfollows from (5.119) that

N1 ≤ (ρ− δ)n ≤ m, (5.121)

N1 ≤ (1− ρ− δ)n ≤ n−m. (5.122)

and

R1m+R2(n−m) ≤ R1(ρ+ δ)n+R2(1− ρ+ δ)n (5.123)

= ((R1 +R2)δ +R)n (5.124)

≤ (2δ +R)n (5.125)

where R = ρR1 + (1 − ρ)R2 according to (5.116). Observe that Sn is a finite unionof subsets, each of which is a Cartesian product of a bR1mc-rectifiable set in Rm

and a bR2(n−m)c-rectifiable set in Rn−m. In view of Lemma 18 and (5.125), Sn isb(R + 2δ)nc-rectifiable,

Now we calculate the measure of Sn under µn:

P Xn ∈ Sn ≥ P Xn ∈ Sn,W n ∈ Wn (5.126)

=

b(ρ+δ)nc∑m=d(ρ−δ)ne

∑‖wn‖0=m

P W n = wn

× P Xn ∈ Sn | W n = wn (5.127)

≥b(ρ+δ)nc∑

m=d(ρ−δ)ne

∑|T |=m

∑supp(wn)=T

P W n = wn

× PXT ∈ Sm1 , XT c ∈ Sn−m2 | W n = wn

(5.128)

=

b(ρ+δ)nc∑m=d(ρ−δ)ne

∑‖wn‖0=m

P W n = wn

× µm1 (Sm1 )µn−m2 (Sn−m2 ) (5.129)

≥b(ρ+δ)nc∑

m=d(ρ−δ)ne

∑‖wn‖0=m

(1− ε1)(1− ε2)P W n = wn (5.130)

≥ (1− ε)P W n ∈ Wn (5.131)

≥ 1− ε− ε′, (5.132)

where

• (5.128): by construction of Sn in (5.120).• (5.130): by (5.121) and (5.122).• (5.131): ε = ε1 + ε2 according to (5.115).• (5.132): by (5.118).

88

In view of Lemma 19, R + 2δ is (ε + ε′)-achievable for X. By the arbitrariness of δand ε′, the ε-achievability of R follows.

Proof of Theorem 36. Let the distribution of X be µ = (1− ρ)µd + ρµc as defined in(3.70), where µd is discrete and µc is absolutely continuous. By Lemma 18, countablesets are 0-rectifiable. For any 0 < ε < 1, there exists A > 0 such that µc([−A,A]) >1 − ε. By definition, [−A,A] is 1-rectifiable. Therefore by Lemma 19 and Theorem43, ρ is an ε-achievable rate for X. The converse follows from (5.104) and Theorem15.

5.4.4 Achievability for singular distributions

In this subsection we prove Theorem 37 for memoryless sources, using isomorphismresults in ergodic theory. The proof outline is as follows: a classical result in ergodictheory states that Bernoulli shifts are isomorphic if they have the same entropy.Moreover, the homomorphism can be chosen to be finitary, that is, each coordinateonly depends on finitely many coordinates. This finitary homomorphism naturallyinduces a Lipschitz decoder in our setup; however, the caveat is that the Lipschitzcontinuity is with respect to an ultrametric (Definition 18) that is not equivalent tothe usual Euclidean distance. Nonetheless, by an arbitrarily small increase in thecompression rate, the decoder can be modified to be Lipschitz with respect to theEuclidean distance. Before proceeding to the proof, we first present some necessaryresults of ultrametric spaces and finitary coding in ergodic theory.

Definition 18. Let (X, d) be a metric space. d is called an ultrametric if

d(x, z) ≤ maxd(x, y), d(y, z) (5.133)

for all x, y, z ∈ X.

A canonical class of ultrametric spaces is the ultrametric Cantor space [106]: letX = 0, . . . ,M − 1Z+ denote the set of all one-sided M -ary sequences x = (x0, . . .).To endow X with an ultrametric, define

dα(x, y) =

0 x = y,

α−minn∈N:xn 6=yn x 6= y.(5.134)

Then for every α > 1, dα is an ultrametric on X . In a similar fashion, we define anultrametric on [0, 1]k by considering the M -ary expansion of real vectors. Similar tothe binary expansion defined in (2.46), for xk ∈ [0, 1]k, i ∈ Z+, M ∈ N and M ≥ 2,define

(xk)M,i =⌊M ixk

⌋−M

⌊M i−1xk

⌋∈ 0, . . . ,M − 1k. (5.135)

thenxk =

∑i∈Z+

(xk)M,iM−i. (5.136)

89

Denoting for brevity(xk) = ((xk)M,0, (x

k)M,1, . . .), (5.137)

(5.134) induces an ultrametric on [0, 1]k:

d(xk, yk) = dM((xk), (yk)). (5.138)

It is important to note that d is not equivalent to the `∞ distance (or any `p distance),since we only have

d(xk, yk) ≥ 1

M

∥∥xk − yk∥∥∞ . (5.139)

To see the impossibility of the other direction of (5.139), consider x = 1/M andy =

∑li=2M

−i. As l → ∞, |x − y| → 0 but d(x, y) remains 1/M . Therefore, a

Lipschitz function with respect to d is not necessarily Lipschitz under ‖·‖∞. However,the following lemma bridges the gap if the dimension of the domain and the Lipschitzconstant are allowed to increase.

Lemma 21. Let d be the ultrametric on [0, 1]k defined in (5.138). W ⊂ [0, 1]k

and g : (W, d) → (Rn, ‖·‖∞) is Lipschitz. Then there exists W ′ ⊂ [0, 1]k+1 andg′ : (W ′, ‖·‖∞)→ (Rn, ‖·‖∞) such that g(W ) = g′(W ′) and g′ is Lipschitz.


Next we recall several results on finitary coding of Bernoulli shifts. Kolmogorov-Ornstein theory studies whether two processes with the same entropy rate are iso-morphic. In [107] Kean and Smorodinsky showed that two double-sided Bernoullishifts of the same entropy are finitarily isomorphic. For the single-sided case, DelJunco [108] showed that there is a finitary homomorphism between two single-sidedBernoulli shifts of the same entropy, which is a finitary improvement of Sinai’s the-orem [109]. We will see how a finitary homomorphism of the digits is related to areal-valued Lipschitz function, and how to apply Del Junco’s ergodic-theoretic resultto our problem.

Definition 19 (Finitary homomorphisms). Let C and D be finite sets. Let σ and τdenote the left shift operators on the product spaces X = CZ+ and Y = DZ+ respec-tively. Let µ and ν be measures on X and Y (with product σ-algebras). A homomor-phism φ : (X , µ, σ)→ (Y , ν, τ) is a measure preserving mapping that commutes withthe shift operator, i.e., ν = µ φ−1 and φ σ = τ φ µ-a.e. φ is said to be finitaryif there exist sets of zero measure A ⊂ X and B ⊂ Y such that φ : X\A → Y\B iscontinuous (with respect to the product topology).

Informally, finitariness means that for almost every x ∈ X , φ(x)0 is determinedby finitely many coordinates in x. The following lemma characterizes this intuitionin precise terms:

Lemma 22 ([110, Conditions 5.1, p. 281]). Let X = CZ+ and Y = DZ+. Letφ : (X , µ, σ) → (Y , ν, τ) be a homomorphism. Then the following statements areequivalent:

90

1. φ is finitary.

2. For µ-a.e. x ∈ X , there exists j(x) ∈ N, such that for any x′ ∈ X , (x′)j(x)0 =

(x)j(x)0 implies that (φ(x))0 = (φ(x′))0.

3. For each j ∈ D, the inverse image φ−1y ∈ Y : y0 = j of each time-0 cylinderset in Y is, up to a set of measure 0, a countable union of cylinder sets in X .

Theorem 44 ([108, Theorem 1]). Let P and Q be probability distributions on finitesets C and D. Let (X , µ) = (CZ+ , P Z+) and (Y , ν) = (DZ+ , QZ+). If P and Q eachhave at least three non-zero components and H(P ) = H(Q), then there is a finitaryhomomorphism φ : (X , µ)→ (Y , ν).

We now use Lemmas 21 – 22 and Theorem 44 to prove Theorem 37.

Proof of Theorem 37. Without loss of generality, assume that the random variablesatisfies 0 ≤ X ≤ 1. Denote by P the common distribution of the M -ary digits of X.By Lemma 1,

d(X) =H(P )

logM. (5.140)

Fix n. Let d = d(X), k = ddne and C = 0, . . . ,M − 1n. Let Q be a probabilitymeasure on D = 0, . . . ,M − 1k such that H(P n) = H(Q). Such a Q alwaysexists because log |D| = k logM ≥ nH(P ). Let µ = (P n)Z+ and ν = QZ+ denotethe product measure on CZ+ and DZ+ respectively. Since µ and ν has the sameentropy rate, by Theorem 44, there exists a finitary homomorphism φ : (DZ+ , ν, σ)→(CZ+ , µ, τ). By the characterization of finitariness in Lemma 22, for any u ∈ DZ+ ,there exists j(u) ∈ N such that φ(u)0 is determined only by u0, . . . , uj(u). Denote theclosed ultrametric ball

Bu = v ∈ DZ+ : d(u, v) ≤M−(j(u)+1), (5.141)

where d is defined in (5.138). Then for any v ∈ Bu, φ(u)0 = φ(v)0. Note thatBu : u ∈ DZ+ forms a countable cover of DZ+ . This is because Bu is just a

cylinder set in DZ+ with base (u)j(u)0 , and the total number of cylinders is countable.

Furthermore, since intersecting ultrametric balls are contained in each other [111],there exists a sequence u(i) in DZ+ , such that

⋃i∈NBu(i) partitions DZ+ . Therefore,

for all 0 < ε < 1, there exists N , such that ν(E) ≥ 1− ε, where

E =N⋃i=1

Bu(i) . (5.142)

91

For x ∈ [0, 1]k, recall the M -ary expansion of x defined in (5.137), denoted by (x) ∈DZ+ . Let

F = φ(E) ⊂ CZ+ (5.143)

W = x ∈ [0, 1]k : (x) ∈ E (5.144)

S = x ∈ [0, 1]n : (x) ∈ F. (5.145)

Since φ is measure-preserving, µ = ν φ−1, therefore

P Xn ∈ S = µ(F ) = ν(φ−1(F )) ≥ ν(E) ≥ 1− ε. (5.146)

Next we use φ to construct a real-valued Lipschitz mapping g. Define g : [0, 1]k →[0, 1]n by

g(x) =∑i∈Z+

φ((x))iM−i. (5.147)

Since φ commutes with the shift operator, for all z ∈ DZ+ , φ(z)i = (τ iφ(z))0 =φ(τ iz)0. Also, for x ∈ [0, 1]k, τ i(x) = (M ix). Therefore

g(x) =∑i∈Z+

φ((M ix))0M−i. (5.148)

Next we proceed to show that g : (W, d)→ ([0, 1]n, ‖·‖∞) is Lipschitz. In view of(5.139) and (5.134), it is sufficient to show that φ : (E, dM)→ (CZ+ , dM) is Lipschitz.Let

J = max1≤i≤N

j(u(i)). (5.149)

First observe that φ is MJ -Lipschitz on each ultrametric ball Bu(i) in E. To see this,consider distinct points v, w ∈ Bu(i) . Let dM(v, w) = M−m. Then m ≥ j(u(i)) + 1.Since (v)m−1

0 = (w)m−10 , φ(v) and φ(w) coincide on their first m − j(u(i)) − 1 digits.

Therefore

dM(φ(v), φ(w)) ≤M−m+j(u(i)) (5.150)

≤M j(u(i))dM(v, w) (5.151)

≤MJdM(v, w). (5.152)

Since every closed ultrametric ball is also open [111, Proposition 18.4], Bu(1) , . . . , Bu(N)

are disjoint, therefore φ is L-Lipschitz on E for some L > 0. Then for any y, z ∈ W ,

‖g(y)− g(z)‖∞ ≤Md(g(y), g(z)) (5.153)

=MdM(φ((y)), φ((z))) (5.154)

≤MLdM((y), (z)) (5.155)

=ML d(y, z). (5.156)

where

92

• (5.153): by (5.139);• (5.154) and (5.156): by (5.138);• (5.155): by (5.152).

Hence g : (W, d)→ ([0, 1]n, ‖·‖∞) is Lipschitz.By Lemma 21, there exists a subset W ′ ⊂ [0, 1]k+1 and a Lipschitz mapping

g′ : (W ′, ‖·‖∞) → (Rn, ‖·‖∞) such that g′(W ′) = g(W ) = S. By Kirszbraun’stheorem [104, 2.10.43], we extend g′ to a Lipschitz function gn : [0, 1]k+1 → Rn. ThenS = gn(W ). Since gn is continuous and [0, 1]k+1 is compact, by Lemma 52, thereexists a Borel function fn : Rn → [0, 1]k+1, such that gn = f−1

n on S.To summarize, we have obtained a Borel function fn : Rn → [0, 1]k+1 and a Lips-

chitz function gn : [0, 1]k+1 → Rn, where k = ddne, such that P gn(fn(Xn)) = Xn ≥P Xn ∈ S ≥ 1− ε. Therefore we conclude that R(ε) ≤ d. The converse follows fromTheorem 35.

Lastly, we show thatR(0) = d = logM m (5.157)

for the special case when P is equiprobable on its support, where m = |supp(P )|.Recalling the construction of self-similar measures in Section 2.3, we first note that thedistribution of X is a self-similar measure that is generated by the IFS (S0, . . . , SM−1),where

Si(x) =x+ i

M, i = 0, . . . ,M − 1. (5.158)

This IFS satisfies the open set condition, since (0, 1) ⊃⋃M−1i=0 Si((0, 1)) and the union

is disjoint. Denote by E ⊂ [0, 1] the invariant set of the reduced IFS Si : i ∈supp(P ). By [25, Corollary 4.1], the distribution of X, denoted by µX , is in fact thenormalized d-dimensional Hausdorff measure on E, i.e., µX(·) = Hd(· ∩ E)/Hd(E).Therefore µXn(·) = Hdn(· ∩ En)/Hdn(En). By [18, Exercise 9.11], there exists aconstant c1 > 0, such that for all xn ∈ En,

Ddn(Hdn, xn) ≥ c1, (5.159)

that is, En has positive lower dn-density everywhere. By [112, Theorem 4.1(1)], forany k > dn, there exists F ⊂ En such that Hdn(F ) = 0 and En\F is k-rectifiable.Therefore P Xn ∈ F = 0. By Lemma 19, the rate d is 0-achievable.

Remark 19. The essential idea for constructing the Lipschitz mapping in [112, The-orem 4.1(1)] is the following: cover En and [0, 1]k with δ-balls respectively, denotedby B1, . . . , BM and B1, . . . , BN. Let g1 be the mapping such that each Bi ismapped to the center of some Bj. By (5.159), g1 can be shown to be Lipschitz. Thenfor each i and j cover Bi ∩ En and Bj by δ2-balls and repeat the same procedure toconstruct g2. In this fashion, a sequence of Lipschitz mappings gm with the sameLipschitz constant is obtained, which converges to a Lipschitz mapping whose imageis En.

93

5.5 Lossless linear compression with Lipschitz de-

compressors

In this section we prove the second half of Theorem 32, which shows that fordiscrete-continuous mixture with finite entropy, it is possible to achieve linear com-pression and Lipschitz decompression with bounded Lipschitz constants simultane-ously. To this end, we need the following large-deviations result on Gaussian randommatrices.

Lemma 23. Let σmin(Bn) denote the smallest singular value of the kn ×mn matrixBn consisting of i.i.d. Gaussian entries with zero mean and variance 1

n. For any

t > 0, denoteFkn,mn(t) , P σmin(Bn) ≤ t. (5.160)

Suppose that mnkn

n→∞−−−→ α ∈ (0, 1) and knn

n→∞−−−→ R > 0. Then

lim infn→∞

1

nlog

1

Fkn,mn(t)≥ R(1− α) log

1

t√

eR− Rh(α)

2. (5.161)

Proof. For brevity let Hn =√nBn and suppress the dependence of kn and mn on

n. Then HTnHn is an l × l Gaussian Wishart matrix. The minimum eigenvalue of

HTnHn has a density, which admits the following upper bound [113, p. 553].

fλmin(HTnHn)(x) ≤ Ek,mx

k−m−12 e−

x2 , x ≥ 0, (5.162)

where

Ek,m ,√π2−

k−m+12

Γ(k+1

2

)Γ(m2

)Γ(k−m+1

2

)Γ(k−m+2

2

) . (5.163)

Then

P σmin(Bn) ≤ t = Pλmin(HT

nHn) ≤ nt2

(5.164)

≤ Ek,m

∫ nt2

0

xk−m−1

2 e−x2 dx (5.165)

≤ 2nk−m+1

2 Ek,mk −m+ 1

tk−m+1. (5.166)

Applying Stirling’s approximation to (5.166) yields (5.161).

The next lemma provides a probabilistic lower bound on the operator norm of theinverse of a Gaussian random matrix restricted on an affine subspace. The point ofthis result is that the lower bound depends only on the dimension of the subspace.

Lemma 24. Let A be a k × n random matrix with i.i.d. Gaussian entries with zeromean and variance 1

n. Let k > l. Then for any l-dimensional affine subspace U of

94

Rn,

P

minx∈U\0

‖Ax‖‖x‖

≤ t

≥ Fk,l+1(t). (5.167)

Proof. By definition, there exists v ∈ Rn and an m-dimensional linear subspace Vsuch that U = v + V . First assume that v /∈ V . Then 0 /∈ U . Let v0, . . . , vl be anorthonormal basis for V ′ = span(v, V ). Set Ψ = [v0, . . . , vl]. Then

minx∈U

‖Ax‖‖x‖

= minx∈V ′‖Ax‖‖x‖

(5.168)

= miny∈Rk‖AΨy‖‖y‖

(5.169)

= σmin(AΨ). (5.170)

Then (5.167) holds with equality since AΨ is an k × (l + 1) random matrix withi.i.d. normal entries of zero mean and variance 1

n. If v ∈ V , then (5.167) holds with

equality and l + 1 replaced by l. The proof is then complete because m 7→ Fk,m(t) isdecreasing.

Lemma 25. Let T be a union of N affine subspaces of Rn with dimension not ex-ceeding l. Let P Xn ∈ T ≥ 1 − ε. Let A be defined in Lemma 24 independent ofXn. Then

PXn ∈ T, inf

y∈T\X

‖A(y −X)‖‖y −X‖

≥ t

≥ 1− ε′, (5.171)

whereε′ = ε+NFk,l+1(t). (5.172)

Moreover, there exists a subset E ⊂ Rk×n with P A ∈ E ≥ 1−√ε′, such that for any

K ∈ E, there exists a Lipschitz continuous function gK : Rk → Rn with Lip(gK) ≤ t−1

andP gA(AXn) 6= Xn ≥ 1−

√ε′. (5.173)

Proof. By the independence of Xn and A,

PXn ∈ T, inf

y∈T\X

‖A(y −X)‖‖y −X‖

≥ t

=

∫T

PXn(dx)P

infz∈(T−x)\0

‖Az‖‖z‖

≥ t

(5.174)

≥ P Xn ∈ T(1−NFk,l+1) (5.175)

≥ 1− ε′. (5.176)

where (5.175) follows by applying Lemma 24 to each affine subspace in T − x andthe union bound. To prove (5.173), denote by p(K) the probability in the left-handside of (5.171) conditioned on the random matrix A being equal to K. By Fubini’s

95

theorem and Markov inequality,

Pp(A) ≥ 1−

√ε′≥ 1−

√ε′. (5.177)

Put E = K : p(K) ≥ 1−√ε′. For each K ∈ E, define

UK =

x ∈ T : inf

y∈T\x

‖K(y − x)‖‖y − x‖

≥ t

⊂ T. (5.178)

Then for any x, y ∈ UK, we have

‖K(x− y)‖ ≥ t ‖x− y‖ , (5.179)

which implies that K|UK, the linear mapping K restricted on the set UK, is injective.

Moreover, its inverse gK : K(UK) → UK is t−1-Lipschitz. By Kirszbraun’s theorem[104, 2.10.43], gK can be extended to a Lipschitz function on the whole space Rk withthe same Lipschitz constant. For those K /∈ E, set gK ≡ 0. Since P Xn ∈ UK ≥1−√ε′ for all K ∈ E, we have

P gK(KXn) 6= Xn ≥ P Xn ∈ UA,A ∈ E ≥ 1−√ε′. (5.180)

This completes the proof of the lemma.

Proof of the second half of Theorem 32. It remains to establish the achievability partof (5.19): R(ε) ≤ γ. Fix δ, δ′ > 0 arbitrarily small. In view of Lemma 25, to prove theachievability of R = γ + δ+ δ′, it suffices to show that, with high probability, Xn liesin the union of exponentially many affine subspaces whose dimensions do not exceednR.

To this end, let t(xn) denote the discrete part of xn, i.e., the vector formed bythose xi ∈ A in increasing order of i. For k ≥ 1, define

Tk =

zk ∈ Ak :

1

k

k∑i=1

log1

Pd(zi)≤ H(Pd) + δ′

. (5.181)

Since H(Pd) <∞, we have |Tk| ≤ exp((H(Pd) + δ′)k). Moreover, P kd (Tk) ≥ 1− ε for

all sufficiently large k.Recall the generalized support spt(xn) of xn defined in (3.73). Let

Cn =xn ∈ Rn :

∣∣|spt(xn)| − γn∣∣ ≤ δn, t(xn) ∈ Tn−|spt(xn)|

(5.182)

=⋃

S⊂1,...,n||S|−γn|≤δn

⋃z∈Tn−|S|

xn ∈ Rn : spt(xn) = S, t(xn) = z . (5.183)

Note that each of the subsets in the right-hand side of (5.183) is an affine subspaceof dimension no more than (γ + δ)n. Therefore Cn consists of Nn affine subspaces,

96

with Nn ≤∑|k−γn|≤δn |Tn−k|, hence

lim supn→∞

1

nlogNn ≤ (H(Pd) + δ′)(1− γ − δ)n. (5.184)

Moreover, for sufficiently large n, we have

P Xn ∈ Cn =∑

||S|−γn|≤δnP Xn ∈ Cn, spt(Xn) = S (5.185)

=∑

||S|−γn|≤δnP spt(Xn) = SP n−|S|

d (Tn−|S|) (5.186)

≥ P∣∣|spt(Xn)| − γn

∣∣ ≤ δn

(1− ε) (5.187)

≥ 1− 2ε. (5.188)

where (5.187) is due to (3.72). To apply Lemma 24, it remains to select a t sufficientlysmall but fixed, such that NnFk,m+1(t) = o(1) as n→∞. This is always possible, inview of (5.184) and Lemma 23, as long as

(1− α)R log1

t√

eR− Rh(α)

2> (H(Pd) + δ′)(1− γ − δ), (5.189)

where α = γ+δR

and R = γ + δ + δ′. By the arbitrariness of δ and δ′, the proof of

R(ε) ≤ γ is complete.

5.6 Lossless stable decompression

First we present a key result used in the converse proof of Theorem 38.

Lemma 26. Let S ⊂ Rn and T ⊂ [0, 1]k. Let g : T → S be bijective and 2−m-stablewith respect to the `∞ distance. Then for any δ > 0, there exists q : [S]m → [0, 1]l,where l = k + dδne such that

‖q(x)− q(y)‖∞ ≥ min2−m, 2−1δ (5.190)

holds for all distinct x and y.


The proof of Lemma 26 uses a graph coloring argument and the following resultfrom graph theory.

Lemma 27 (Greedy Graph Coloring, [114, Theorem 13.1.5]). Let G be a graph.Denote by ∆(G) the maximum degree of vertices in G and by χ(G) the coloring numberof G. Then the greedy algorithm produces a (∆(G) + 1)-coloring of the vertices of Gand hence

χ(G) ≤ ∆(G) + 1. (5.191)

97

Now we prove the converse coding theorem for stable analog compression.

Proof of Theorem 38 (Converse). Fix 0 ≤ ε < 1, δ > 0 and m ∈ N arbitrarily. LetR = R(ε, 2−m) + δ and k = bRnc. By definition, there exist fn : Rn → [0, 1]k andgn : [0, 1]k → Rn, such that gn is 2−m-stable and

P gn(fn(Xn)) 6= Xn ≤ ε. (5.192)

Let Sn = xn ∈ Rn : gn(fn(xn)) = xn, then

1− ε ≤ P Xn ∈ Sn (5.193)

≤ P [Xn]m ∈ [Sn]m. (5.194)

holds for any n. Since Xn is an i.i.d. random vector, [Xn]m is a random vectorconsisting of n i.i.d. copies of the source [X]m. By the strong converse of the losslesssource coding theorem for discrete memoryless sources, we have

lim infn→∞

|[Sn]m|n

≥ H([X]m). (5.195)

Next we upper bound the cardinality of [Sn]m through a volume argument. Sincegn : f(Sn) → Sn is bijective and 2−m-stable, by Lemma 26, there exists a mappingq : [Sn]m → [0, 1]l, where

l = k + dδne, (5.196)

such that‖q(x)− q(y)‖∞ ≥ min2−m, 2−

1δ , (5.197)

which implies that points in the image set q([Sn]m) are separated by at least 2−m

for all m > 1δ. Therefore B∞

(q(x), 2−(m+1)

)are disjoint for all x ∈ [Sn]m. Since

B∞ (z, r) is just an l-dimensional cube of side length 2r, we have

Leb(Bl∞(q(x), 2−(m+1)

))= 2−ml. (5.198)

Therefore

|[Sn]m| ≤Leb

([0, 1]k

)2−ml

= 2ml, (5.199)

hence

H([X]m)

m≤ 1

mlim infn→∞

log |[Sn]m|n

(5.200)

≤ 1

mlim supn→∞

log |[Sn]m|n

(5.201)

≤ R + δ (5.202)

= R(ε, 2−m) + 2δ, (5.203)

98

holds for all m > 1δ, where (5.200) is due to (5.195). Recalling Lemma 1, we have

d(X) = lim supm→∞

H([X]m)

m(5.204)

≤ lim supm→∞

R(ε, 2−m) (5.205)

≤ lim sup∆→∞

R(ε,∆). (5.206)

where (5.205) is because of (5.203) and the arbitrariness of δ.

The achievability of stable compression is shown through an explicit construction.

Proof of Theorem 38 (Achievability). Fix an arbitrary ∆ > 0. There exists m ∈ N,such that

2−(m+1) ≤ ∆ < 2−m. (5.207)

Consider the memoryless source [Xi]m+1 : i ∈ N with entropy H([X]m+1). for allδ > 0, there exists Gn ⊂ 2−(m+1)Zn such that

|Gn| ≤ 2n[H([X]m+1)+δ] (5.208)

limn→∞

P [Xn]m+1 ∈ Gn = 1. (5.209)

DefineSn = xn ∈ Rn : [xn]m+1 ∈ G . (5.210)

Thenlimn→∞

P Xn ∈ Sn = 1. (5.211)

Note that Sn is a disjoint union of at most |Gn| hypercubes in Rn. Let k ∈ N. Weconstruct fn : Rn → [0, 1]k as follows. Uniformly divide [0, 1]k into 2km hypercubes.Choose 2k(m−1) cubes out of them as the codebook, such that any pair of cubes areseparated by 2−m in `∞ distance. Denote the codebook by Cn. In view of (5.208),letting

k

n=H([X]m+1) + δ

m− 1. (5.212)

is sufficient to ensure that the number of cubes in Cn is no less than |Gn|. Constructfn by mapping each hypercube in Cn to one in Sn by an arbitrary bijective mapping.Denote the inverse of fn by gn. Now we proceed to show that gn is ∆-stable. Considerany x, y ∈ Cn such that ‖x− y‖∞ ≤ ∆. Since ∆ < 2−m, x and y reside in thesame hypercube in Cn, so will g(x) and g(y) be in Sn. Therefore ‖g(x)− g(y)‖∞ ≤2−(m+1) ≤ ∆. In view of (5.212), the ∆-stability of gn implies that

R(ε,∆) ≤ H([X]m+1) + δ

m− 1(5.213)

99

holds for all 0 < ε < 1, where m =⌊log2

1∆

⌋by (5.207). Sending ∆→ 0, we have

lim sup∆↓0

R(ε,∆) ≤ lim supm→∞

H([X]m+1) + δ

m− 1= d(X) (5.214)

for all 0 < ε < 1.

100

Chapter 6

Noisy compressed sensing

This chapter extends the analog compression framework in Chapter 5 to the casewhere the encoder output is contaminated by additive noise. Noise sensitivity, theratio between the reconstruction error and the noise variance, is used to gauge therobustness of the decoding procedure. Focusing on optimal decoders (the MMSEestimator), in Section 6.2 we formulate the fundamental tradeoff between measure-ment rate and reconstruction fidelity under three classes of encoders, namely optimalnonlinear, optimal linear and random linear encoders. In Section 6.5, optimal phasetransition thresholds of noise sensitivity are determined as a functional of the inputdistribution. In particular, we show that for discrete-continuous mixtures, Gaussiansensing matrices incur no penalty on the phase transition threshold with respect tooptimal nonlinear encoding. Our results also provide a rigorous justification of previ-ous results based on replica heuristics [7] in the weak-noise regime. The material inthis chapter has been presented in part in [115, 116].

6.1 Setup

The basic setup of noisy compressed sensing is a joint source-channel coding prob-lem as shown in Fig. 6.1, where we assume that

Encoderf : Rn→Rk +

σNk

Decoderg: Rk→Rn

Xn Y k Y k Xn

Figure 6.1: Noisy compressed sensing setup.

• The source is memoryless, where Xn consists of i.i.d. copies of a real-valuedrandom variable X with unit variance.

• The channel is memoryless with additive Gaussian noise σNk where Nk ∼N (0, Ik).

101

• The encoder satisfies a unit average power constraint:

1

kE[‖f(Xn)‖2

2] ≤ 1. (6.1)

• The reconstruction error is gauged by the per-symbol MSE distortion:

d(xn, xn) =1

n‖xn − xn‖2

2. (6.2)

In this setup, the fundamental question is: For a given noise variance and mea-surement rate, what is the smallest possible reconstruction error? For a given encoderf , the corresponding optimal decoder g is the MMSE estimator of the input Xn giventhe channel output Y k = f(Xn) + σNk. Therefore the optimal distortion achievedby encoder f is

mmse(Xn|f(Xn) + σNk). (6.3)

6.2 Distortion-rate tradeoff

For a fixed noise variance, we define three distortion-rate functions that corre-sponds to optimal encoding, optimal linear encoding and random linear encodingrespectively.

6.2.1 Optimal encoder

Definition 20. The minimal distortion achieved by the optimal encoding scheme isgiven by:

D∗(X,R, σ2) , lim supn→∞

1

ninff

mmse(Xn|f(Xn) + σNk) : E[‖f(Xn)‖2

2] ≤ Rn.

(6.4)

The asymptotic optimization problem in (6.4) can be solved by applying Shannon’sjoint source-channel coding separation theorem for memoryless channels and sources[90], which states that the lowest rate that achieves distortion D is given by

R =RX(D)

C(σ2), (6.5)

where RX(·) is the rate-distortion function of X defined in (2.64) and C(σ2) =12

log(1 + σ−2) is the AWGN channel capacity. By the monotonicity of the rate-distortion function, we have

D∗(X,R, σ2) = R−1X

(R

2log(1 + σ−2)

). (6.6)

102

In general, optimal encoders are nonlinear. In fact, Shannon’s separation theoremstates that the composition of an optimal lossy source encoder and an optimal channelencoder is asymptotically optimal when blocklength n → ∞. Such a constructionresults in an encoder that is finitely-valued, hence nonlinear. For fixed n and k, linearencoders are in general suboptimal.

6.2.2 Optimal linear encoder

To analyze the fundamental limit of noisy compressed sensing, we restrict theencoder f to be a linear mapping, denoted by a matrix H ∈ Rk×n. Since Xn are i.i.d.with zero mean and unit variance, the input power constraint (6.1) simplifies to

E[‖HXn‖22] = E[XnTHTHXn] = Tr(HTH) = ‖H‖2

F ≤ k, (6.7)

where ‖·‖F denotes the Frobenius norm.

Definition 21. Define the optimal distortion achievable by linear encoders as:

D∗L(X,R, σ2) , lim supn→∞

1

ninfH

mmse(Xn|HXn + σNk) : ‖H‖2

F ≤ Rn. (6.8)

A general formula for D∗L(X,R, σ2) is yet unknown. One example where we com-pute it explicitly in Section 6.4 is the Gaussian source.

6.2.3 Random linear encoder

We consider the ensemble performance of random linear encoders and relax thepower constraint in (6.7) to hold on average:

E[‖A‖2

F

]≤ k. (6.9)

In particular, we focus on the following ensemble of random sensing matrices, forwhich (6.9) holds with equality.:

Definition 22. Let An be a k × n random matrix with i.i.d. entries of zero meanand variance 1

n. The minimal expected distortion achieved by this ensemble of linear

encoders is given by:

DL(X,R, σ2) , lim supn→∞

1

nmmse(Xn|(AXn + σNk,A)) (6.10)

= lim supn→∞

mmse(X1|(AXn + σNk,A)) (6.11)

where (6.11) follows from symmetry and mmse(·|·) is defined in (4.2).

Based on the statistical-physics approach in [117, 118], the decoupling principleresults in [118] were imported into the compressed sensing setting in [7] to postulate

103

the following formula for DL(X,R, σ2). Note that this result is based replica heuristicslacking a rigorous justification.

Claim 1 ([7, Corollary 1, p.5]).

DL(X,R, σ2) = mmse(X, ηRσ−2), (6.12)

where 0 < η < 1 satisfies the following equation [7, (12) – (13), pp. 4 – 5]:1

1

η= 1 +

1

σ2mmse(X, ηRσ−2). (6.13)

When (6.13) has more than one solution, η is chosen to be the one that minimizesthe free energy

I(X;√ηRσ−2X +N) +

R

2(η − 1− log η). (6.14)

Note that DL(X,R, σ2) defined in Definition 22 depends implicitly on the entry-wise distribution of the random measurement matrix A, while Claim 1 asserts thatthe the formula (6.12) for DL(X,R, σ2) holds universally, as long as its entries arei.i.d. with variance 1

n. Consequently, it is possible to employ random sparse measure-

ment matrix so that each encoding operation involves only a few signal components,for example,

Aij ∼p

2δ −1√

pn+ (1− p)δ0 +

p

2δ 1√

pn(6.15)

for some 0 < p < 1.

Remark 20. The solutions to (6.13) are precisely the stationary points of the freeenergy (6.14) as a function of η. In fact it is possible for (6.12) to have arbitrarilymany solutions. To see this, let β = ηRσ−2. Then (6.12) becomes

βmmse(X, β) = R− σ2β. (6.16)

Consider a Cantor distributed X, for which βmmse(X, β) oscillates logarithmically inβ between the lower and upper MMSE dimension of X (c.f. Fig. 4.1). In Fig. 6.2 weplot both sides of (6.16) against log3 β. We see that as σ2 decreases, β 7→ R − σ2βbecomes flatter and number of solutions blows up.

6.2.4 Properties

Theorem 45. 1. For fixed σ2, D∗(X,R, σ2) and D∗L(X,R, σ2) are both decreasing,convex and continuous in R on (0,∞).

2. For fixed R, D∗(X,R, σ2) and D∗L(X,R, σ2) are both increasing, convex andcontinuous in 1

σ2 on (0,∞).

1In the notation of [7, (12)], γ and εµ correspond to Rσ−2 and R in our formulation.

104

2 4 6 8 10 12

0.62

0.63

0.64

R− σ2β

log3 β

βmmse(X,β

)

Figure 6.2: For sufficiently small σ2, there are arbitrarily many solutions to (6.16).

3.D∗(X,R, σ2) ≤ D∗L(X,R, σ2) ≤ DL(X,R, σ2) ≤ 1. (6.17)

Proof. 1. Fix σ2. Monotonicity with respect to the measurement rate R is straight-forward from the definition of D∗ and D∗L. Convexity follows from time-sharingbetween two encoding schemes. Finally, convexity on the real line implies con-tinuity.

2. Fix R. For any n and any encoder f : Rn → Rk, σ2 7→ mmse(Xn|f(Xn) +σNk) is increasing. This is a result of the infinite divisibility of the Gaussiandistribution as well as the data processing lemma of MMSE [87]. Consequently,σ2 7→ D∗(X,R, σ2) is also increasing. Since D∗ can be equivalently defined as

D∗(X,R, σ2) = lim supn→∞

1

ninff

mmse(Xn|f(Xn) + σNk) : E[‖f(Xn)‖2

2] ≤ k

σ2

,

(6.18)convexity in 1

σ2 follows from time-sharing. The results on D∗L follows analo-gously.

3. The leftmost inequality in (6.17) follows directly from the definition, while therightmost inequality follows because we can always discard the measurements

105

and use the mean as an estimate. Although the best sensing matrix will beat theaverage behavior of any ensemble, the middle inequality in (6.17) is not quitetrivial because the power constraint in (6.1) is not imposed on each matrix inthe ensemble. We show that for any fixed ε > 0,

D∗L(X,R, σ2) ≤ DL(X,R, (1 + ε)2σ2). (6.19)

By the continuity of σ−2 7→ D∗L(X,R, σ2) proved in Theorem 45, σ2 7→D∗L(X,R, σ2) is also continuous. Therefore sending ε ↓ 0 in (6.19) yields thesecond inequality in (6.17). To show (6.19), recall that A consists of i.i.d.

entries with zero mean and variance 1n. Since k = nR, 1

k‖A‖2

F

P−→1 as n → ∞,by the weak law of large numbers. Therefore P A ∈ En → 1 where

En ,A : ‖A‖2

F ≤ k(1 + ε)2. (6.20)

Therefore

DL(X,R, (1 + ε)σ2)

= lim supn→∞

1

nmmse

(Xn|AXn + (1 + ε)2σ2Nk,A

)(6.21)

= lim supn→∞

1

nmmse

(Xn∣∣∣ A

1 + εXn +Nk,A

)(6.22)

≥ lim supn→∞

P A ∈ Enn

E[mmse

(Xn∣∣∣ A

1 + εXn +Nk,A

) ∣∣∣A ∈ En] (6.23)

= D∗L(X,R, σ2), (6.24)

where (6.23) holds because A1+ε

satisfies the power constraint for any A ∈ En.

Remark 21. Alternatively, the convexity properties of D∗ can be derived from (6.6).Since RX(·) is decreasing and concave, R−1

X (·) is decreasing and convex, which, com-posed with the concave mapping σ−2 7→ R

2log(1 + σ−2), gives a convex function

σ−2 7→ D∗(X,R, σ2) [119, p. 84]. The convexity of R 7→ D∗(X,R, σ2) can be simi-larly proved.

Remark 22. Note that the time-sharing proofs of Theorem 45.1 and 45.2 do notwork for DL, because time-sharing between two random linear encoders results in ablock-diagonal matrix with diagonal submatrices each filled with i.i.d. entries. Thisensemble is outside the scope of the random matrices with i.i.d. entries consider inDefinition 22. Therefore, proving convexity of R 7→ DL(X,R, σ2) amounts to showingthat the replacing all zeroes in the block-diagonal matrix with independent entries ofthe same distribution always helps with the estimation. This is certainly not true forindividual matrices.

106

6.3 Phase transition of noise sensitivity

The main objective of noisy compressed sensing is the robust recovery of the inputvector, obtaining a reconstruction error that is proportional to the noise variance. Toquantify robustness, we analyze noise sensitivity : the ratio between the mean-squareerror and the noise variance, at a given R and σ2. As a succinct characterization ofrobustness, we focus particular attention on the worst-case noise sensitivity:

Definition 23. The worst-case noise sensitivity of optimal encoding is defined as

ζ∗(X,R) = supσ2>0

D∗(X,R, σ2)

σ2. (6.25)

For linear encoding, ζ∗L(X,R) and ζL(X,R) are analogously defined with D∗ in (6.25)replaced by D∗L and DL.

The phase transition threshold of the noise sensitivity is defined as the minimalmeasurement rate R such that the noise sensitivity is bounded for all noise variance[6, 115]:

Definition 24. Define

R∗(X) , inf R > 0 : ζ∗(X,R) <∞ . (6.26)

For linear encoding, R∗L(X) and RL(X) are analogously defined with ζ∗ in (6.26)replaced by ζ∗L and ζL.

By (6.17), the phase transition thresholds in Definition 24 are ordered naturallyas

0 ≤ R∗(X) ≤ R∗L(X) ≤ RL(X) ≤ 1, (6.27)

where the rightmost inequality is shown below (after Theorem 46).

Remark 23. In view of the convexity properties in Theorem 45.2, the three worst-case sensitivities in Definition 23 are all (extended real-valued) convex functions ofR.

Remark 24. Alternatively, we can consider the asymptotic noise sensitivity by re-placing the supremum in (6.25) with the limit as σ2 → 0, denoted by ξ∗, ξ∗L andξL respectively. Asymptotic noise sensitivity characterizes the convergence rate ofthe reconstruction error as the noise variance vanishes. Since D∗(X,R, σ2) is alwaysbounded above by varX = 1, we have

ζ∗(X,R) <∞⇔ ξ∗(X,R) <∞. (6.28)

Therefore R∗(X) can be equivalently defined as the infimum of all rates R > 0, suchthat

D∗(X,R, σ2) = O(σ2), σ2 → 0. (6.29)

107

This equivalence also applies to D∗L and DL. It should be noted that although finiteworst-case noise sensitivity is equivalent to finite asymptotic noise sensitivity, thesupremum in (6.28) need not be achieved as σ2 → 0. An example is given by theGaussian input analyzed in Section 6.4.

6.4 Least-favorable input: Gaussian distribution

In this section we compute the distortion-rate tradeoffs for the Gaussian input dis-tribution, which simultaneously maximizes all three distortion-rate functions subjectto the variance constraint.

Theorem 46. Let XG ∼ N (0, 1). Then for any R, σ2 and X of unit variance,

D∗(X,R, σ2) ≤ D∗(XG, R, σ2) =

1

(1 + σ−2)R. (6.30)

D∗L(X,R, σ2) ≤ D∗L(XG, R, σ2) =

1− R

1 + σ2R ≤ 1,

1

1 +Rσ−2R > 1,

(6.31)

DL(X,R, σ2) ≤ DL(XG, R, σ2) =

1

2

(1−R− σ2 +

√(1−R)2 + 2(1 +R)σ2 + σ4

)(6.32)

Proof. Since the Gaussian distribution maximizes the rate-distortion function point-wise under the variance constraint [120, Theorem 4.3.3], the inequality in (6.30) fol-lows from (6.6). For linear encoding, linear estimators are optimal for Gaussian inputssince the channel output and the input are jointly Gaussian, but suboptimal for non-Gaussian inputs. Moreover, the linear MMSE depends only on the input variance.Therefore the inequalities in (6.31) and (6.32) follow. The distortion-rate functionsof XG are computed in Appendix D.

Remark 25. While Theorem 46 asserts that Gaussian input is the least favorableamong distributions with a given variance, it is not clear what the least favorablesparse input distribution is, i.e., the maximizer of D∗ (D∗L or DL) in the set PX :P X = 0 ≥ γ.

The Gaussian distortion-rate tradeoffs in (6.30) – (6.32) are plotted in Figs. 6.3and 6.4. We see that linear encoders are optimal for lossy encoding of Gaussiansources in Gaussian channels if and only if R = 1, i.e.,

D∗(XG, 1, σ2) = D∗L(XG, 1, σ

2), (6.33)

which is a well-known fact [121, 122]. As a result of (6.32), the rightmost inequalityin (6.27) follows.

108

0 1 2 3 4 5snr0.0

0.2

0.4

0.6

0.8

1.0

Distortion

R = 0.3

R = 1

R = 4

Figure 6.3: D∗(XG, R, σ2), D∗L(XG, R, σ

2), DL(XG, R, σ2) against snr = σ−2.

0 2 4 6 8 10R0.0

0.2

0.4

0.6

0.8

1.0

Distortion

Linear coding is optimal iff R = 1

Random linear

Optimal linear

Optimal

Figure 6.4: D∗(XG, R, σ2), D∗L(XG, R, σ

2), DL(XG, R, σ2) against R when σ2 = 1.

109

Next we analyze the high-SNR asymptotics of (6.30) – (6.32). Clearly,D∗(XG, R, σ

2) is the smallest among the three, which vanishes polynomially inσ2 according to

D∗(XG, R, σ2) = σ2R +O(σ2R+2), σ2 → 0 (6.34)

regardless of how small R > 0 is. For linear encoding, we have

D∗L(X,R, σ2) =

1−R +Rσ2 +O (σ4) 0 ≤ R < 1,

σ2 +O(σ4) R = 1,σ2

R+O(σ4) R > 1.

(6.35)

DL(XG, R, σ2) =

1−R + R

1−Rσ2 +O(σ4) 0 ≤ R < 1,

σ − σ2

2+O(σ3) R = 1,

σ2

R−1+O(σ4) R > 1.

(6.36)

The weak-noise behavior of D∗L and DL are compared in different regimes of measure-ment rates:

• 0 ≤ R < 1: both D∗L and DL converge to 1−R > 0. This is an intuitive result,because even in the absence of noise, the orthogonal projection of the inputvector onto the nullspace of the sensing matrix cannot be recovered, whichcontributes a total mean-square error of (1 − R)n; Moreover, DL has strictlyworse second-order asymptotics than D∗L, especially when R is close to 1.

• R = 1: DL = σ(1+o(1)) is much worse thanD∗L = σ2(1+o(1)), which is achievedby choosing the encoding matrix to be identity. In fact, with nonnegligibleprobability, the optimal estimator that attains (6.32) blows up the noise powerwhen inverting the random matrix;

• R > 1: both D∗L and DL behave according to Θ(σ2), but the scaling constantof D∗L is strictly worse, especially when R is close to 1.

The foregoing high-SNR analysis shows that the average performance of randomsensing matrices with i.i.d. entries is much worse than that of optimal sensing matri-ces, except if R 1 or R 1. Although this conclusion stems from the high-SNRasymptotics, we test it with several numerical results. Fig. 6.3 (R = 0.3 and 5) andFig. 6.4 (σ2 = 1) illustrate that the superiority of optimal sensing matrices carriesover to the regime of non-vanishing σ2. However, as we will see, randomly selectedmatrices are as good as the optimal matrices (and in fact, optimal nonlinear encoders)as far as the phase transition threshold of the noise sensitivity is concerned.

From (6.34) and (6.36), we observe that both D∗L and DL exhibit a sharp phasetransition near the critical rate R = 1:

limσ2→0

D∗L(X,R, σ2) = limσ2→0

DL(X,R, σ2) (6.37)

= (1−R)+. (6.38)

110

where x+ , max0, x. Moreover, from (6.30) – (6.32) we obtain the worst-case andasymptotic noise sensitivity functions for the Gaussian input as follows:

ζ∗(XG, R) =

(R−1)R−1

RRR ≥ 1

∞ R < 1, ξ∗(XG, R) =

0 R > 1

1 R = 1

∞ R < 1

(6.39)

ζ∗L(XG, R) = ξ∗L(XG, R) =

1R

R ≥ 1

∞ R < 1(6.40)

and

ζL(XG, R) = ξL(XG, R) =

1

R−1R > 1

∞ R ≤ 1(6.41)

The worst-case noise sensitivity functions are plotted in Fig. 6.5 against the measure-ment rate R. Note that (6.39) provides an example for Remark 24: for Gaussianinput and R > 1, the asymptotic noise sensitivity for optimal coding is zero, whilethe worst-case noise sensitivity is always strictly positive.

0 1 2 3 4R0.0

0.5

1.0

1.5

2.0

2.5

3.0

Noise sensitivity

ζ∗L = 1R

ζL = 1R−1

ζ∗ = (R−1)R−1

RR

+∞

unstable stable

Figure 6.5: Worst-case noise sensitivity ζ∗, ζ∗L and ζL for the Gaussian input, whichall become infinity when R < 1 (the unstable regime).

In view of (6.39) – (6.41), the phase-transition thresholds in the Gaussian signalcase are:

R∗(XG) = R∗L(XG) = RL(XG) = 1. (6.42)

The equality of the three phase-transition thresholds is in fact very general. In thenext section, we formulate and prove the existence of the phase transition thresholds

111

of all three distortion-rate functions for discrete-continuous mixtures, which turn outto be equal to the information dimension of the input distribution.

6.5 Non-Gaussian inputs

This section contains our main results, which show that the phase transitionthresholds are equal to the information dimension of the input, under rather gen-eral conditions. Therefore, the optimality of random sensing matrices in terms of theworst-case sensitivity observed in Section 6.4 carries over well beyond the Gaussiancase. Proofs are deferred to Section 6.6.

The phase transition threshold for optimal encoding is given by the upper infor-mation dimension of the input:

Theorem 47. For any X that satisfies (2.10),

R∗(X) = d(X). (6.43)

Moreover, if PX is a discrete-continuous mixture as in (1.4), then as σ → 0,

D∗(X,R) =exp

(2H(Pd)1−γ

γ− 2D(Pc)

)(1− γ)2(1−γ)γγ

σ2Rγ (1 + o(1)) (6.44)

where D(·) denotes the non-Gaussianness of a probability measure defined in (2.68).Consequently the asymptotic noise sensitivity of optimal encoding is

ξ∗(X,R) =

∞ R < γexp(2H(Pd) 1−γ

γ−2D(Pc))

(1−γ)2(1−γ)γγR = γ

0 R > γ.

(6.45)

Assuming the validity of the replica calculations of DL in Claim 1, it can be shownthat the phase transition threshold for random linear encoding is always sandwichedbetween the lower and the upper MMSE dimension of the input. The relationshipbetween the MMSE dimension and the information dimension in Theorem 20 playsa key role in analyzing the minimizer of the free energy (6.14).2

Theorem 48. Assume that Claim 1 holds for X. Then for any i.i.d. random mea-surement matrix A whose entries have zero mean and variance 1

n,

D(X) ≤ RL(X) ≤ D(X). (6.46)

2It can be shown that in the limit of σ2 → 0, the minimizer of (6.14) when R > D(X) andR < D(X) corresponds to the largest and the smallest root of the fixed-point equation (6.12)respectively.

112

Therefore if D(X) exists, we have

RL(X) = D(X) = d(X), (6.47)

and in addition,

DL(X,R, σ2) =d(X)

R− d(X)σ2(1 + o(1)). (6.48)

Remark 26. In fact, the proof of Theorem 48 shows that the converse part (leftinequality) of (6.46) holds as long as there is no residual error in the weak-noise limit,that is, if DL(X,R, σ2) = o(1) as σ2 → 0, then R ≥ D(X) must hold. Therefore theconverse in Theorem 48 holds even if we weaken the right-hand side in (6.29) fromO(σ2) to o(1).

The general result in Theorem 48 holds for any input distribution but relies on theconjectured validity of Claim 1. For the special case of discrete-continuous mixtures asin (1.4), in view of Theorem 27, (6.47) predicts that the phase transition threshold forrandom linear encoding is γ. As the next result shows, this conclusion can indeed berigorously established for random matrix encoders with i.i.d. Gaussian entries. Since,in view of Theorem 47, nonlinear encoding achieves the same threshold, (random)linear encoding suffices for robust reconstruction as long as the input distributioncontains no singular component.

Theorem 49. Assume that X has a discrete-continuous mixed distribution as in(1.4), where the discrete component P d has finite entropy. Then

R∗(X) = R∗L(X) = RL(X) = γ. (6.49)

Moreover, (6.49) holds for any non-Gaussian noise distribution with finite non-Gaussianness.

Remark 27. The achievability proof of RL(X) is a direct application of Theorem32, where the Lipschitz decompressor in the noiseless case is used as a suboptimalestimator in the noisy case. The outline of the argument is as follows: suppose that wehave obtained a sequence of linear encoders and L(R)-Lipschitz decoders (An, gn)with rate R and error probability εn → 0 as n→∞. Then

E[∥∥gn(AnX

n + σNk)−Xn∥∥2]≤ L(R)2σ2E

[∥∥Nk∥∥2]

+ εn = kL(R)2σ2varN + εn,

(6.50)which implies robust construction is achievable at rate R and the worst-case noisesensitivity is upper bounded by L(R)2R.

Notice that the above Lipschitz-based noisy achievability approach applies to anynoise with finite variance, without requiring that the noise be additive, memoryless orthat it have a density. In contrast, replica-based results rely crucially on the fact thatthe additive noise is memoryless Gaussian. Of course, in order for the converse (via

113

R∗) to hold, the non-Gaussian noise needs to have finite non-Gaussianness. The dis-advantage of this approach is that currently it lacks an explicit construction becausethe extendability of Lipschitz functions (Kirszbraun’s theorem) is only an existenceresult which relies on Hausdorff maximal principle [100, Theorem 1.31, p. 21].

In view of Remark 13, it it interesting to examine whether the Lipschitz noiselessdecoder allows us to achieve the optimal asymptotic noise sensitivity 1

R−γ predicted

by the replica method in (6.48). If the discrete component of the input distributionhas non-zero entropy (e.g., for simple signals (1.3)), the Lipschitz constant in (5.21)depends exponentially on 1

R−γ . However, for sparse input (1.2), the Lipschitz constant

in (5.21) behaves as 1√R−δ , which, upon squaring, attains the optimal noise sensitivity.

Remark 28. We emphasize the following “universality” aspects of Theorem 49:

• Gaussian random sensing matrices achieve the optimal transition threshold forany discrete-continuous mixture;

• The fundamental limit depends on the input statistics only through the weighton the analog component, regardless of the specific discrete and continuouscomponents. In the conventional sparsity model (1.2) where PX is the mix-ture of an absolutely continuous distribution and a mass of 1 − γ at 0, thefundamental limit is γ;

• The suboptimal estimator used in the achievability proof comes from the noise-less Lipschitz decoder, which does not depend on the noise distribution, letalone its variance;

• The conclusion holds for non-Gaussian noise as long as it has finite non-Gaussianness.

Remark 29. Assume the validity of Claim 1. Combining Theorem 47, Theorem 48and (6.27) gives an operational proof for the inequality d(X) ≤ D(X), which hasbeen proven analytically in Theorem 20.

6.6 Proofs

Proof of Theorem 47. The proof of (6.26) is based on the low-distortion asymptoticsof RX(D). Particularizing Theorem 8 to the MSE distortion (r = 2), we have

lim supD↓0

RX(D)12

log 1D

= d(X), (6.51)

Converse: Fix R > R∗(X). By definition, there exits a > 0 such thatD∗(X,R, σ2) ≤ aσ2 for all σ2 > 0. By (6.6),

R12

log(1 + σ−2)≥ RX(aσ2). (6.52)

Dividing both sides by 12

log 1aσ2 and taking lim supσ2→0 yield R > d(X) in view of

(6.51). By the arbitrariness of R, we have R∗(X) > d(X).

114

Achievability : Fix δ > 0 arbitrarily and let R = d(X) + 2δ. We show thatR ≤ R∗(X), i.e., worst-case noise sensitivity is finite. By Remark 24, this is equivalentto achieving (6.29). By (6.51), there exists D0 > 0 such that for all D < D0,

RX(D) ≤ d(X) + δ

2log

1

D. (6.53)

By Theorem 45, D∗(X,R, σ2) ↓ 0 as σ2 ↓ 0. Therefore there exists σ20 > 0, such that

D∗(X,R, σ2) < D0 for all σ2 < σ20. In view of (6.6) and (6.53), we have

d+ 2δ

2log

1

σ2= RX(D∗(X,R, σ2)) ≤ d(X) + δ

2log

1

D∗(X,R, σ2), (6.54)

i.e.,

D∗(X,R, σ2) ≤ σ2d(X)+2δ

d(X)+δ (6.55)

holds for all σ2 < σ20. This obviously implies the desired (6.29).

We finish the proof by proving (6.44) and (6.45). The low-distortion asymptoticexpansion of rate-distortion functions of discrete-continuous mixtures is found in [42,Corollary 1], which improves (2.64):3

RX(D) =γ

2log

γ

D+ h(γ) + (1− γ)H(Pd) + γh(Pc) + o(1), (6.56)

where PX is given by (1.4). Plugging into (6.6) gives (6.44), which implies (6.45) asa direct consequence.

Proof of Theorem 48. Achievability : We show that RL(X) ≤ D(X). Fix δ > 0arbitrarily and let R = D(X) + 2δ. Set γ = Rσ−2 and β = ηγ. Define

u(β) = βmmse(X, β)−R(

1− β

γ

)(6.57)

f(β) = I(X;√βX +NG)− R

2log β (6.58)

g(β) = f(β) +Rβ

2γ, (6.59)

which satisfy the following properties:

1. Since mmse(X, ·) is smooth on (0,∞) [66, Proposition 7], u, f and g are allsmooth functions on (0,∞). In particular, by the I-MMSE relationship [65],

f(β) =βmmse(X, β)−R

2β. (6.60)

3In fact h(γ) + (1− γ)H(Pd) + γh(Pc) is the γ-dimensional entropy of (1.4) defined by Renyi [24,Equation (4) and Theorem 3].

115

2. For all 0 ≤ β ≤ γ,

f(β) ≤ g(β) ≤ f(β) +R

2. (6.61)

3. Recalling the scaling law of mutual information in (2.66), we have

lim supβ→∞

f(β)

log β=d(X)−D(X)− δ

2≤ −δ

2, (6.62)

where the last inequality follows from the sandwich bound between informationdimension and MMSE dimension in (4.46).

Note that the solutions to (6.13) are precisely the roots of u, at least one ofwhich belongs to [0, γ] because u(0) = −R < 0, u(γ) = γmmse(X, γ) > 0 and u iscontinuous. According to Claim 1,

DL(X,R, σ2) = mmse(X, βγ), (6.63)

where βγ is the root of u in [0, γ] that maximizes g(β).Proving the achievability of R amounts to showing that

lim supσ→0

DL(X,R, σ2)

σ2<∞, (6.64)

which, in view of (6.63), is equivalent to

lim infγ→∞

βγγ> 0. (6.65)

By the definition of D(X) and (6.62), there exists B > 0 such that for all β > B,

βmmse(X, β) < R− δ (6.66)

and

f(β) ≤ −δ4

log β. (6.67)

In the sequel we focus on sufficiently large γ. Specifically, we assume that

γ >R

δmax

B, e−

4δ (K−

R2 ), (6.68)

where K , minβ∈[0,B] g(β) is finite by the continuity of g.Let

β0 =δγ

R. (6.69)

Then β0 > B by (6.68). By (6.66), u(β0) = β0 mmse(X, β0) − R + δ < 0. Sinceu(γ) > 0, by the continuity of u and the intermediate value theorem, there exists

116

β0 ≤ β∗ ≤ γ, such that u(β∗) = 0. By (6.66),

f(β) ≤ − δ

2β< 0, ∀β > B. (6.70)

Hence f strictly decreases on (B,∞). Denote the root of u that minimizes f(β) byβ′γ, which must lie beyond β∗. Consequently, we have

B <δγ

R= β0 ≤ β∗ ≤ β′γ. (6.71)

Next argue that βγ cannot differ from β′γ by a constant factor. In particular, we showthat

βγ ≥ e−Rδ β′γ, (6.72)

which, combined with (6.71), implies that

βγγ≥ δ

Re−

Rδ (6.73)

for all γ that satisfy (6.68). This yields the desired (6.65). We now complete theproof by showing (6.72). First, we argue that that βγ > B. This is because

g(βγ) ≤ g(β′γ) (6.74)

= f(β′γ) +Rβ′γ2γ

(6.75)

≤ f(β0) +R

2(6.76)

≤ − δ

4log

δγ

R+R

2(6.77)

< K (6.78)

= minβ∈[0,B]

g(β). (6.79)

where

• (6.74): by definition, βγ and β′γ are both roots of u and β′γ minimizes g amongall roots;

• (6.76): by (6.71) and the fact that f is strictly decreasing on (B,∞);• (6.77): by (6.67);• (6.79): by (6.68).

117

Now we prove (6.72) by contradiction. Suppose βγ < e−Rδ β′γ. Then

g(β′γ)− g(βγ) =R

2γ(β′γ − βγ) + f(β′γ)− f(βγ) (6.80)

≤ R

2+

∫ β′γ

βγ

f(b)db (6.81)

≤ R

2− δ

2log

β′γβγ

(6.82)

< 0, (6.83)

contradicting (6.74), where (6.82) is due to (6.70).Converse: We show that RL(X) ≥ D(X). Recall that RL(X) is the minimum

rate that guarantees that the reconstruction error DL(X,R, σ2) vanishes according toO(σ2) as σ2 → 0. In fact, we will show a stronger result: as long as DL(X,R, σ2) =o(1) as σ2 → 0, we have R ≥ D(X). By (6.63), DL(X,R, σ2) = o(1) if and only ifβγ →∞. Since u(βγ) = 0, we have

R ≥ lim supγ→∞

R

(1− βγ

γ

)(6.84)

= lim supγ→∞

βγ mmse(X, βγ) (6.85)

≥ lim infβ→∞

βmmse(X, β) (6.86)

= D(X). (6.87)

Asymptotic noise sensitivity: Finally, we prove (6.48). Assume that D(X) exists,i.e., D(X) = d(X), in view of (4.46). By definition of D(X), we have

mmse(X, β) =D(X)

β+ o

(1

β

), β →∞. (6.88)

As we saw in the achievability proof, wheneverR > D(X), (6.65) holds, i.e., ηγ = Ω(1)as γ →∞. Therefore, as γ →∞, we have

1

ηγ= 1 +

γ

Rmmse(X, ηγγ) = 1 +

D(X)

ηγR+ o(1), (6.89)

i.e.,

ηγ = 1− D(X)

R+ o(1). (6.90)

118

By Claim 1,

DL(X,R, σ2) = mmse(X, ηγγ) (6.91)

=1− ηγηγ

σ2 (6.92)

=D(X)

R−D(X)σ2(1 + o(1)). (6.93)

Remark 30. Note that βγ is a subsequence parametrized by γ, which may takeonly a restricted subset of values. In fact, even if we impose the requirement thatDL(X,R, σ2) = O(σ2), it is still possible that the limit in (6.85) lies strictly betweenD(X) and D(X). For example, if X is Cantor distributed, it can be shown that thelimit in (6.85) approaches the information dimension d(X) = log3 2.

Proof of Theorem 49. Let R > γ. We show that the worst-case noise sensitivityζ∗(X,R) under Gaussian random sensing matrices is finite. We construct a subop-timal estimator based on the Lipschitz decoder in Theorem 32.4 Let An be a k × nGaussian sensing matrix and gAn the corresponding L(R)-Lipschitz decoder, suchthat k = Rn and P En = o(1) where En = gAn(AnX

n) 6= Xn denotes the errorevent. Without loss of generality, we assume that gAn(0) = 0. Fix τ > 0. Then

E[∥∥gAn(AnX

n + σNk)−Xn∥∥2]

≤ E[∥∥gAn(AnX

n + σNk)−Xn∥∥2

1Ecn]

+ 2L(R)2E[∥∥AnX

n + σNk∥∥2

1En]

+ 2E[∥∥Xn

∥∥21En

](6.94)

≤ kL(R)2σ2 + τn(2L(R)2 + 1)P En+ 2E[∥∥Xn

∥∥21‖Xn‖2>τn

]+ 2L(R)2E

[∥∥AnXn + σNk

∥∥21‖AnXn+σNk‖2>τn

] (6.95)

Dividing both sides of (6.95) by n and sending n→∞, we have: for any τ > 0,

lim supn→∞

1

nE[∥∥gAn(AnX

n + σNk)−Xn∥∥2]

≤ RL(R)2σ2 + 2 supn

1

nE[∥∥Xn

∥∥21‖Xn‖2>τn

]+ 2L(R)2 sup

n

1

nE[∥∥AnX

n + σNk∥∥2

1‖AnXn+σNk‖2>τn] . (6.96)

4Since we assume that varX = 1, the finite-entropy condition of Theorem 32 is satisfied auto-matically.

119

Since 1n‖Xn‖2 L2

−→1 and 1n

∥∥AnXn + σNk

∥∥2 L2

−→R(1 +σ2), which implies uniform inte-grability, the last two terms on the right-hand side of (6.96) vanish as τ →∞. Thiscompletes the proof of ζ∗(X,R) ≤ RL2(R).

120

Chapter 7

Suboptimality of practicalreconstruction algorithms

In this section we compare the optimal phase transition thresholds derived inChapters 5 and 6 to the suboptimal thresholds obtained by popular low-complexityalgorithms, analyzed in [15, 4, 6]. The material in this chapter has been presented inpart in [116].

We focus on the following three input distributions considered in [4, p. 18915],indexed by χ = ±,+ and respectively, which all belong to the family of inputdistributions of the mixture form in (1.4):

± : sparse signals (1.2);+ : sparse non-negative signals (1.2) with the absolutely continuous compo-

nent Pc supported on R+. : simple signals (1.3) with p = 1

2[15, Section 5.2, p. 540] where Pc is some

absolutely continuous distribution supported on the unit interval.

7.1 Noiseless compressed sensing

In the noiseless case, we consider the phase transition of error probability. Thephase transition thresholds for linear programming (LP) decoders and the approxi-mate message passing (AMP) decoder [4, 123] have been obtained in [3, 4]. Greedyreconstruction algorithms have been analyzed in [124], which only provide upperbounds (achievability results) on the transition threshold of measurement rate. Inthis section we focus our comparison on algorithms whose phase transition thresholdsare known exactly.

The following LP decoders are tailored to the three input distributions χ = ±,+and respectively (see Equations (P1), (LP) and (Feas) in [16, Section I]):

g±(y) = arg min‖x‖1 : x ∈ Rn,Ax = y, (7.1)

g+(y) = arg min‖x‖1 : x ∈ Rn+,Ax = y, (7.2)

g(y) = x : x ∈ [0, 1]n,Ax = y. (7.3)

121

For sparse signals, (7.1) – (7.2) are based on `1-minimization (also known as Ba-sis Pursuit) [12], while for simple signals, the decoder (7.3) solves an LP feasibilityproblem. Note that the above decoders are universal in the sense that it requireno knowledge of the input distributions other than which subclass they belong. Ingeneral the decoders in (7.1) – (7.3) output a list of vectors upon receiving the mea-surement. The reconstruction is successful if and only if the output list containsonly the true vector. The error probability is thus defined as P gχ(AXn) 6= Xn,evaluated with respect to the product measure (PX)n × PA.

The phase transition thresholds of the reconstruction error probability for decoders(7.1) – (7.3) are derived in [3] using combinatorial geometry. For sparse signals and `1-minimization decoders (7.1) – (7.2), the expressions of the corresponding thresholdsR±(γ) and R+(γ) are quite involved, given implicitly in [3, Definition 2.3]. As observedin [4, Finding 1], R±(γ) and R+(γ) agree numerically with the solutions of followingequation, respectively:1

γ = maxz≥0

R− 2((1 + z2)Φ(−z)− zϕ(z))

1 + z2 − 2((1 + z2)Φ(−z)− zϕ(z)), (7.4)

γ = maxz≥0

R− ((1 + z2)Φ(−z)− zϕ(z))

1 + z2 − ((1 + z2)Φ(−z)− zϕ(z)), (7.5)

For simple signals, the phase transition threshold is proved to be [15, Theorem 1.1]

R(γ) =γ + 1

2. (7.6)

Moreover, substantial numerical evidence in [4, 123] suggests that the phrase tran-sition thresholds for the AMP decoder coincide with the LP thresholds for all threeinput distributions. The suboptimal thresholds obtained from (7.4) – (7.6) are plot-ted in Fig. 7.1 along with the optimal threshold obtained from Theorem 32 which isγ.2 In the gray area below the diagonal in the (γ,R)-phase diagram, any sequenceof sensing matrices and decompressors will fail to reconstruct the true signal withprobability that tends to one. Moreover, we observe that the LP or AMP decoder isseverely suboptimal unless γ is close to one.

In the highly sparse regime which is most relevant to compressed sensing problems,it follows from [16, Theorem 3] that for sparse signals (χ = ± and +),

Rχ(γ) = 2γ loge

1

γ(1 + o(1)), as γ → 0, (7.7)

1In the series of papers (e.g. [15, 16, 4, 6]), the phase diagram are parameterized by (ρ, δ), whereδ = R is the measurement rate and ρ = γ

R is the ratio between the sparsity and rate. In this thesis,the more straightforward parameterization (γ,R) is used instead. The ratio γ

Rχ(γ)is denoted by

ρ(γ;χ) in [4].2A similar comparison between the suboptimal threshold R±(γ) and the optimal threshold γ has

been provided in [10, Fig. 2(a)] based on replica-heuristic calculation.

122

0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

subo

ptim

alth

resh

old

±

+

Opt

imal

thre

shold

γ

R

Figure 7.1: Comparison of suboptimal thresholds (7.4) – (7.6) with optimal threshold.

which implies that Rχ has infinite slope at γ = 0. Therefore when γ 1, the `1

and AMP decoders require on the order of 2s logens

measurements to successfullyrecover the unknown s-sparse vector. In contrast, s measurements suffice when usingan optimal decoder (or `0-minimization decoder). The LP or AMP decoders are alsohighly suboptimal for simple signals, since R(γ) converges to 1

2instead of zero as γ →

0. This suboptimality is due to the fact that the LP feasibility decoder (7.3) simplyfinds any xn in the hypercube [0, 1]n that is compatible with the linear measurements.Such a decoding strategy does not enforce the typical discrete structure of the signal,since most of the entries saturate at 0 or 1 equiprobably. Alternatively, the followingdecoder achieves the optimal γ: define

T(xn) =1

n

n∑i=1

(1xi=0,1xi /∈0,1,1xi=1

).

The decoder outputs the solution to Axn = yk such that T(xn) is closest to(1−γ

2, γ, 1−γ

2

)(in total variation distance for example).

7.2 Noisy compressed sensing

In the noisy case, we consider the AMP decoder [6] and the `1-penalized least-squares (i.e. LASSO) decoder [13]:

gλLASSO(y) = argminx∈Rn

1

2‖y −Ax‖2 + λ ‖x‖1 , (7.8)

123

where λ > 0 is a regularization parameter. For Gaussian sensing matrices and Gaus-sian observation noise, the asymptotic mean-square error achieved by LASSO for afixed λ

limn→∞

1

nE[∥∥Xn − gλLASSO(AXn + σNk)

∥∥2]

(7.9)

can be determined as a function of PX , λ and σ by applying [125, Theorem 1.4]with ψ(x, y) = (x − y)2. For sparse X distributed according to (1.4), the followingformal expressions for the worst-case (or asymptotic)3 noise sensitivity of LASSOwith optimized λ is proposed in [6, Proposition 3.1(1.a)] and verified by extensiveempirical results: for any PX of the mixture form (1.4),

ζLASSO(X,R) = ξLASSO(X,R) =

RR±(γ)R−R±(γ)

R > R±(γ)

∞ R ≤ R±(γ).(7.10)

where R±(γ) is given in (7.5). Furthermore, numerical evidence suggests that (7.10)also applies to the AMP algorithm [6]. In view of (7.10), the phase transition thresh-olds of noise sensitivity for the LASSO and AMP decoder are both R±(γ), whereas theoptimal threshold is γ as a consequence of Theorem 49. Therefore the phase tran-sition boundaries are identical to Fig. 7.1 and the same observation in Section 7.1applies. For sparse signals of the form (1.2) with γ = 0.1, Fig. 7.2 compares thoseexpressions for the asymptotic noise sensitivity of LASSO or AMP algorithms to theoptimal noise sensitivity predicted by Theorem 48 based on replica heuristics. Notethat the phase transition threshold of LASSO is approximately 3.3 times the optimal.Also, as the measurement rate R tends to infinity, the noise sensitivity of LASSOconverges to R±(γ) > 0 rather than zero, while the optimal noise sensitivity vanishesat the speed of 1

R.

3The worst-case and asymptotic noise sensitivity coincide since the worst-case noise variancetends to zero. This is because the least-favorable prior in [6, (2.5)] is supported on 0,±∞.

124

0. 0.5 1. 1.5 2.0.1 0.33R0

1

2

3

4

Sensitivity

LASSO decoder

optimal decoder

Figure 7.2: Phase transition of the asymptotic noise sensitivity: sparse signal model(1.2) with γ = 0.1.

125

Chapter 8

Conclusions and future work

In this thesis we adopt a Shannon-theoretic framework for compressed sensingby modeling input signals as random processes rather than individual sequences.We investigate the phase transition thresholds (minimum measurement rate) of re-construction error probability (noiseless observations) and normalized MMSE (noisyobservations) achievable by optimal nonlinear, optimal linear, and random linear en-coders combined with the corresponding optimal decoders. For discrete-continuousinput distributions, which are most relevant for compressed sensing applications, theoptimal phase transition threshold is shown to be the information dimension of theinput, i.e., the weight of the analog part, regardless of the specific discrete and abso-lutely continuous component. In the memoryless case, this corresponds to the fractionof analog symbols in the source realization. This observation might suggest that thefundamental importance of the mixed discrete-continuous nature of the signal, ofwhich sparsity (or simplicity discussed in Section 1.3) is just one manifestation.

To conclude this chapter, we discuss several open problems in theoretical com-pressed sensing as future research directions.

8.1 Universality of sensing matrix ensemble

Random linear coding is not only an important achievability proof technique inShannon theory, but also an inspiration to obtain efficient schemes in modern codingtheory and compressed sensing.

In the noiseless case, Theorem 39 shows that for discrete-continuous mixtures, theinformation dimension γ can be achieved with Lebesgue-a.e. linear encoders. Thisimplies that random linear encoders drawn from any absolutely continuous distri-bution achieve the desired error probability almost surely. However, it is not clearwhether the optimal measurement rate γ can be obtained using discrete ensembles,e.g., Rademacher matrices.

In the noisy case, one of our main findings is that Gaussian sensing matricesachieve the same phase transition threshold as optimal nonlinear encoding. However,it should be noted that the argument used in the proof of Theorem 49 relies cruciallyon the Gaussianness of the sensing matrices because of two reasons:

126

• The upper bound on the distribution function of the least singular value inLemma 23 is a direct consequence of the upper bound on its density due toEdelman [113], which only holds in the Gaussian case. In fact, we only needthat the exponent in (5.161) diverges as t → 0. It is possible to generalizethis result to other sub-Gaussian ensembles with densities by adapting thearguments in [126, Theorem 1.1]. However, it should be noted that in generalLemma 23 does not hold for discrete ensembles (e.g. Rademacher), because theleast singular value always has a mass at zero with a fixed exponent;

• Due to the rotational invariance of the Gaussian ensemble, the result inLemma 24 does not depend on the basis of the subspace.

The universal optimality of random sensing matrices with non-Gaussian i.i.d. entriesin terms of phase transition thresholds remains an open question question.

8.2 Near-optimal practical reconstruction algo-

rithms

From Chapter 7 we see that the phase transition threshold of state-of-the-artdecoding algorithms (e.g., LP-based and AMP decoders) still lie far from the optimalboundary. Designing low-complexity reconstruction schemes that attain the optimalthresholds remains an outstanding challenge. In particular, for mixture distributions(1.4), there are no known low-complexity algorithm whose phase transition thresholdhas finite slope at γ = 0.

In the case of sparse input distribution (1.2), it is interesting to notice that theoptimal phase transition boundary in Fig. 7.1 can be attained by `0-minimization,while the threshold of `1-minimization is known to be strictly suboptimal. A naturalquestion to ask is the performance of `q-minimization for 0 < q < 1, which has beenthe focus of very active research, e.g., [127, 128, 129, 130, 131]. It is possible that asq moves from one to zero, the phase-transition boundary of `q-minimization evolvescontinuously from the suboptimal `1-boundary to the optimal `0-boundary. How-ever, very little is known except the fact the the performance of `q-minimization doesnot deteriorate as q decreases [127]. Only upper bounds on the transition bound-ary are known [130, p. 11]. Note that as q decreases, the complexity of the `q-decoder increases as the optimization problem becomes more non-convex. Thereforethe knowledge about the precise threshold will allow us to strike a good balancebetween decoding performance and computational complexity among this family ofalgorithms.

8.3 Performance of optimal sensing matrices

The majority of the compressed sensing literature is devoted to either randomlygenerated measurement matrices or sufficient conditions on individual measurementmatrices that guarantee successful recovery of sparse vectors. Theorems 32 and 49indicate that when optimal decoders are used, random sensing matrices achieve the

127

optimal phase transition threshold in both noiseless and noisy cases. However, withpragmatic suboptimal recovery algorithms, it might be possible to improve the per-formance by using optimized sensing matrices. Next we consider `p-minimizationdecoders with p ∈ [0, 1]. Given a measurement matrix A ∈ Rk×n, define the `p-minimization decoder as

gp(y) = arg minz:Az=y

‖z‖p . (8.1)

Now we define the following non-asymptotic fundamental limits of `p-minimizationdecoder by allowing to optimize over measurement matrices.

Definition 25. A k × n matrix A is called (s, n, p)-successful if using A as themeasurement matrix, the `p-minimization decoder can recover all s-sparse vectors onRn, i.e., gp A = id restricted on Σs,n, the collection of all s-sparse n-dimensionalvectors. Define k∗p(s, n) as the minimum number of measurements that guarantee thesuccess of `p-minimization decoder in recovering all s-sparse vectors, i.e., k∗p(s, n) =mink ∈ N : ∃A ∈ Rk×n that is (s, n, p)-successful.

From Remark 17 we know that

k∗0(s, n) = min2s, n. (8.2)

Also, it is trivial to show that for all p, k∗p(1, n) = min2, n. For the special case ofp = 1, [132, Proposition 2.3] characterizes a given matrix being (s, n, 1)-successful interms of its null space. However, determining k∗1(s, n) remains an open question. Inthe high-dimensional linearly sparse regime, we define

R∗`p(γ) = lim supn→∞

k∗p(bγnc, n)

n. (8.3)

Then R∗`p(γ) ∈ [0, 1] for all p and γ. From (8.2) we have

R∗`0(γ) = min2γ, 1, (8.4)

while R∗`p(γ) is not known for p 6= 0.

Remark 31. If one relaxes Definition 25 and only requires that all s-sparse positivevectors are successfully recovered, it can be shown ([133, Corollary 1.1] and [134])that `1-minimization attains the `0 performance in (8.2), achieved by a Fourier orVandermonde matrix.

8.4 Compressive signal processing

The current information system consists of three stages: data acquisition, com-pression and processing. Compressed sensing permeates compression in the datacollection process and achieves drastic reduction of data amount. Extending thisrationale, it is more efficient to combine all three stages by directly processing the

128

information condensed from compressible signals. Fundamental limits of this moregeneral paradigm, namely, compressive signal processing, are largely unknown.

As a concrete example, consider the problem of support recovery from noisy ran-dom linear measurements, where the goal is not to recover the unknown sparse vectorwith high fidelity, but to recover its support faithfully. It has been shown that if theSNR is fixed and sensing matrices are i.i.d., reliable support recovery (vanishing blockerror rate) requires infinite rate, because number of necessary measurements scalesas Θ(n log n) [135, 136]. On the other hand, if we tolerate any positive fraction oferror (bit-error rate), finite rate can be achieved [9]. Note that the above conclusionsapply to random sensing matrices. The performance of optimal sensing matrices insupport recovery is a fundamental yet unexplored problem. It is yet to be determinedwhether asymptotic reliable support recovery is possible with finite SNR and finitemeasurement rate using optimized sensing matrices and decoders.

129

Appendix A

Proofs in Chapter 2

A.1 The equivalence of (2.7) to (2.4)

Proof. Let X be a random vector in Rn and denote its distribution by µ. Due tothe equivalence of `p-norms, it is sufficient to show (2.7) for p = ∞. Recalling thenotation for mesh cubes in (3.8), we have

H([X]m) = E[log

1

µ(Cm(2m[X]m))

]. (A.1)

For any 0 < ε < 1, there exists m ∈ N, such that 2−m ≤ ε < 2m−1. Then

Cm([x]m2m) ⊂ B∞(x, 2−m) ⊂ B∞(x, ε) (A.2)

⊂ B∞(x, 2−(m−1)) ⊂ B∞([x]m, 2−(m−2)). (A.3)

As a result of (A.2), we have

E[log

1

µ(B∞(X, ε))

]≤ H([X]m) (A.4)

On the other hand, note that B∞([x]m, 2−(m−2)) is a disjoint union of 8n mesh cubes.

By (A.3), we have

E[log

1

µ(B∞(X, ε))

]≥ E

[log

1

µ(B∞([X]m, 2−(m−2)))

](A.5)

≥ H([X]m)− n log 8. (A.6)

Combining (A.4) and (A.6) yields

H([X]m)− n log 8

m≤

E[log 1

µ(B∞(X,ε))

]log 1

ε

≤ H([X]m)

m− 1(A.7)

By Lemma 1, sending ε ↓ 0 and m→∞ yields (2.7).

130

A.2 Proofs of Theorem 1

Lemma 28. For all p, q ∈ N,

H(〈Xn〉p) ≤ H(〈Xn〉q) + n log

(⌈p

q

⌉+ 1

). (A.8)

Proof. For notational conciseness, we only give the proof for the scalar case (n = 1).

H(〈X〉p) ≤ H(〈X〉p, 〈X〉q) (A.9)

= H(〈X〉p|〈X〉q) +H(〈X〉q). (A.10)

Note that for any l ∈ Z,

H

(〈X〉p

∣∣∣∣〈X〉q =l

q

)= H

(〈X〉p

∣∣∣∣ lq ≤ X

q<l + 1

q

)(A.11)

= H

(bpXc

∣∣∣∣plq ≤ pX <p(l + 1)

q

). (A.12)

Given pX ∈[plq, p(l+1)

q

), the range of bpXc is upper bounded by

⌈pq

⌉+ 1. Therefore

for all l ∈ Z,

H

(〈X〉p

∣∣∣∣〈X〉q =l

q

)≤ log

(⌈p

q

⌉+ 1

). (A.13)

Hence H(〈X〉p|〈X〉q) admits the same upper bound and (A.8) holds.

Lemma 29 ([137, p. 2102]). Let W be an N-valued random variable. Then H(W ) <∞ if E [logW ] <∞.

Proof of Theorem 1. Using Lemma 28 with p = m, q = 1 and p = 1, q = m, we have

H(bXnc)− log 2 ≤ H(〈Xn〉m) ≤ H(bXnc) + logm. (A.14)

2) ⇒ 1): When H(bXnc) is finite, dividing both sides of (A.14) by logm andletting m→∞ results in 1).

1) ⇒ 2): Suppose H(bXnc) = ∞. By (A.14), H(〈Xn〉m) = ∞ for every , and 1)fails. This also proves (2.14).

3)⇔ 2): By the equivalence of norms on Rn, we shall consider the `∞ ball. Define

g(Xn, ε) , E[log

1

PXn(B∞(Xn, ε))

], (A.15)

which is decreasing in ε. Let 2−m ≤ ε < 2−m+1, where m ∈ N. By (A.4) and (A.6)(or [23, Lemma 2.3]), we have

g(Xn, 2−m) ≤ H([Xn]m) ≤ g(Xn, 2−m) + n log 8. (A.16)

131

The equivalence between 3) and 2) then follows from (A.14) and (A.16).3) ⇔ 4): By Lemma 32 in Appendix A.5.4) ⇔ 5): For all snr > snr′ > 0

I(snr) = I(snr′) +1

2

∫ snr

snr′mmse(X, γ)dγ (A.17)

≤ I(snr′) +1

2log

snr

snr′, (A.18)

where (A.17) and (A.18) are due to (4.58) and (4.8) respectively.(2.12) ⇒ 2):

E [log(|X|+ 1)] <∞ (A.19)

⇒ E [log(b|X|c+ 1)] <∞ (A.20)

⇒ H(b|X|c+ 1) <∞ (A.21)

⇒ H(b|X|c) <∞ (A.22)

⇒ H(bXc) <∞ (A.23)

where

• (A.21): by Lemma 29.• (A.23): by

H(bXc) ≤H(b|X|c) +H (bXc|b|X|c) (A.24)

≤H(b|X|c) + log 2. (A.25)

A.3 Proofs of Lemmas 1 – 2

Proof of Lemma 1. Fix any m ∈ N and l ∈ N, such that 2l−1 ≤ m < 2l. ByLemma 28, we have

H([X]l−1) ≤ H(〈X〉m) + log 3, (A.26)

H(〈X〉m) ≤ H([X]l) + log 3. (A.27)

ThereforeH([X]l−1)− log 3

l≤ H(〈X〉m)

logm≤ H([X]l) + log 3

l − 1(A.28)

and hence (2.16) and (2.17) follow.

132

Proof of Lemma 2. Note that

H(〈X〉m) ≤H(dmXem

)+H

(〈X〉m

∣∣∣∣dmXem

)(A.29)

≤H(dmXem

)+ log 2, (A.30)

H

(dmXem

)≤H (〈X〉m) +H

(dmXem

∣∣∣∣ 〈X〉m) (A.31)

≤H (〈X〉m) + log 2. (A.32)

The same bound holds for rounding.

A.4 Proof of Lemma 3

First we present several auxiliary results regarding entropy of quantized randomvariables:

Lemma 30. Let Un and V n be integer-valued random vectors. If −Bi ≤ Ui−Vi ≤ Aialmost surely, then

|H(Un)−H(V n)| ≤n∑i=1

log(1 + Ai +Bi). (A.33)

Proof. H(Un) ≤ H(V n) +H(Un − V n|V n) ≤ H(V n) +∑n

i=1 log(1 + Ai +Bi).

The next lemma shows that the entropy of quantized linear combinations is closeto the entropy of linear combinations of quantized random variables, as long as thecoefficients are integers. The proof is a simple application of Lemma 30.

Lemma 31. Let [·]m be defined in (2.15). For any aK ∈ ZK and any Xn1 , . . . , X

nK,∣∣∣∣∣H

(K∑j=1

aj[Xni ]m

)−H

([K∑j=1

ajXni

]m

)∣∣∣∣∣ ≤ n log

(2 + 2

K∑j=1

|aj|

). (A.34)

Proof of Lemma 3. The translation and scale-invariance of d(·) are immediate conse-quence of the alternative definition of information dimension in (2.8). To prove (2.20)and (2.21), we apply Lemma 31 to obtain

|H([X + Y ]m)−H([X]m + [Y ]m)| ≤ log 6. (A.35)

and note that

maxH([X]m), H([Y ]m) ≤ H([X]m + [Y ]m) ≤ H([X]m) +H([Y ]m). (A.36)

133

Since (2.22) is a simple consequence of independence, it remains to show (2.23).Let U, V,W be integer valued. By the data processing theorem for mutual informa-tion, we have I(V ;U + V +W ) ≤ I(V ;V +W ), i.e.,

H(U + V +W )−H(V +W ) ≤ H(U +W )−H(W ). (A.37)

Replacing U, V,W by [X]m, [Y ]m, [Z]m respectively and applying Lemma 31, we have

H([X + Y + Z]m)−H([Y + Z]m) ≤ H([X + Z]m)−H([Z]m) + 9 log 2. (A.38)

Dividing both sides by m and sending m to ∞ yield (2.23).

A.5 A non-asymptotic refinement of Theorem 9

If H(bXnc) = ∞, then, in view of Theorem 1, we have I(Xn, snr) = ∞ andd(Xn) = ∞, in which case Theorem 9 holds automatically. Next we assume thatH(bXnc) < ∞, then I(Xn, snr) and g(Xn, ε) are both finite for all snr and ε, whereg(Xn, ε) is defined in (A.15). Next we prove the following non-asymptotic result,which, in view of the alternative definition of information dimension in (2.7), impliesTheorem 9.

Lemma 32. Let H(bXnc) <∞. For any snr > 0,

n

2log

6

eπ≤ I(Xn, snr)− g

(Xn, 2

√3snr−1

)≤ n log 3, (A.39)

where g(Xn, ·) is defined in (A.15).

Proof. First consider the scalar case. Let U be uniformly distributed on [−√

3,√

3],which has zero mean and unit variance. Then [138]

I(X, snr) ≤ I(X;√snrX + U) ≤ I(X, snr) +

1

2log

πe

6, (A.40)

where 12

log πe6

= D(PU || N (0, 1)) is the non-Gaussianness of U .Next we provide alternative lower and upper bounds to I(X;

√snrX + U):

(Lower bound) Let ε = 1√snr

. Abbreviate the `∞ ball B∞(x, ε) by B(x, ε). Thenthe density of Y = X + εU is given by

pY (y) =PX(B(y,

√3ε))

2√

3ε. (A.41)

134

Then

I(X;√snrX + U) = D(PY |X ||PY |PX) (A.42)

= E[log

1

PX(B(X + εU,√

3ε))

](A.43)

≥ E[log

1

PX(B(X, 2√

3ε))

], (A.44)

where (A.44) is due to B(X + εU,√

3ε) ⊂ B(X, 2√

3ε), which follows from |U | ≤√

3and the triangle inequality.

(Upper bound) Let QY denote the distribution of X + 3εU , whose density is

qY (y) =PX(B(y, 3

√3ε))

6√

3ε. (A.45)

Then

I(X;√snrX + U) ≤ D(PY |X ||QY |PX) (A.46)

= E[log

3

PX(B(X + εU, 3√

3ε))

](A.47)

≤ E[log

1

PX(B(X, 2√

3ε))

]+ log 3, , (A.48)

where (A.48) is due to B(X + εU, 3√

3ε) ⊃ B(X, 2√

3ε).Assembling (A.40), (A.44) and (A.48) yields (A.39). The proof for the vector case

follows from entirely analogous fashion.


Proof. Let U be an open interval (a, b). Then Fj(U) = (wj + ra, wj + rb) and⋃j

Fj(U) ⊂ U ⇔ minjwj + ra ≥ a and max

jwj + rb ≤ b. (A.49)

and for i 6= j,Fi(U) ∩ Fj(U) = ∅ ⇔ |wi − wj| ≥ r(b− a). (A.50)

Therefore, there exist a < b that satisfy (A.49) and (A.50) if and only if (2.82)holds.

135


Proof. Let X be of the form in (2.80). Let M = max1≤j≤mwj and mk =⌊r−k⌋

wherek ∈ N. Then |X| ≤ rM

1−r . By Lemma 1,

dα(X) = limk→∞

Hα(bmkXc)k log 1

r

. (A.51)

Since ∣∣∣∣∣bmkXc −k−1∑i=0

r−iWk−i

∣∣∣∣∣ ≤ 2rM

1− r, (A.52)

applying Lemma 30 to (A.52) and substituting into (A.51) yield

dα(X) ≤ limk→∞

Hα(∑k−1

i=0 r−iWk−i)

k log 1r

≤ Hα(P )

log 1r

, (A.53)

where we have used Hα(∑k−1

i=0 r−iWk−i) ≤ Hα(W k) = kHα(P ).

A.8 Proof of Theorem 12

First we present several auxiliary results from sumset theory for entropy.

Lemma 33 (Submodularity [139, Lemma A.2]). If X0, X1, X2, X12 are random vari-ables such that X0 = f(X1) = g(X2) and X12 = h(X1, X2), where f, g, h are deter-ministic functions. Then

H(X0) +H(X12) ≤ H(X1) +H(X2). (A.54)

In the sequel, G denotes an arbitrary abelian group.

Lemma 34 (Ruzsa’s triangle inequality [139, Theorem 1.7]). Let X and Y be G-valued random variables. Define the Ruzsa distance1 as

∆(X, Y ) , H(X ′ − Y ′)− 1

2H(X ′)− 1

2H(Y ′), (A.55)

where X ′ and Y ′ are independent copies of X and Y . Then

∆(X,Z) ≤ ∆(X, Y ) + ∆(Y, Z), (A.56)

that is,H(X − Z) ≤ H(X − Y ) +H(Y − Z)−H(Y ). (A.57)

1∆(·, ·) is not a metric since ∆(X,X) > 0 unless X is deterministic.

136

The following entropic analogue of Plunnecke-Ruzsa’s inequality [140] is implicitlyimplied in [141, Proposition 1.3], as pointed out by Tao [142, p. 203].

Lemma 35. Let X, Y1, . . . , Ym be independent G-valued random variables. Then

H(X + Y1 + . . .+ Ym) ≤ H(X) +m∑i=1

[H(X + Yi)−H(X)]. (A.58)

The following entropy inequality for group-valued random variables is based onresults from [143, 144, 145], whose proof can be found in [146, Section II-E].

Lemma 36.1

2≤ ∆(X,X)

∆(X,−X)≤ 2. (A.59)

Lemma 37 (Sum-difference inequality [139, (2.2)]). Let X and Y be independentG-valued random variables. Then ∆(X, Y ) ≤ 3∆(X,−Y ), i.e.,

H(X − Y ) ≤ 3H(X + Y )−H(X)−H(Y ). (A.60)

The foregoing entropy inequalities deal with linear combinations of independentrandom variables with coefficients being ±1. In Theorem 50 we present a new upperbound on the entropy of the linear combination of independent random variableswith general integer coefficients. The following two lemmas lie at the core of theproof, which allow us to reduce the case from p to either p ± 1 (additively) or p

2

(multiplicatively).

Lemma 38 (Additive descent). Let X, Y and Z be independent G-valued randomvariables. Let p, r ∈ Z\0. Then

H(pX + Y ) ≤ H((p− r)X + rZ + Y ) +H(X − Z)−H(Z). (A.61)

Proof. Applying Lemma 33 with X1 = ((p − r)X + rZ + Y,X − Z), X2 = (X, Y ),X0 = pX + Y and X12 = (X, Y, Z) yields (A.61).

Lemma 39 (Multiplicative descent). Let X,X ′ and Y be independent G-valued ran-dom variables where X ′ has the same distribution as X. Let p ∈ Z\0. Then

H(pX + Y ) ≤ H(p

2X + Y

)+H(2X −X ′)−H(X). (A.62)

137

Proof. By (A.57), we have

H(pX + Y ) ≤ H(p

2X + Y

)+H

(pX − p

2X ′)−H

(p2X). (A.63)

= H(p

2X + Y

)+H(2X ′ −X)−H(X). (A.64)

Theorem 50. Let X and Y be independent G-valued random variables. Let p, q ∈Z\0. Then

H(pX + qY )−H(X + Y )

≤ 9(lp + lq)H(X + Y )− (5lp + 4lq)H(Y )− (5lq + 4lp)H(X), (A.65)

where lp , 1 + blog |p|c.

It is interesting to compare Theorem 50 and its proof techniques to its additive-combinatorial counterpart [147, Theorem 1.3], which gives an upper bound on thecardinality of the sum of dilated subsets.

Proof of Theorem 50. Let Xk and Yk are sequences of independent copies of Xand Y respectively. By (A.57), we have

H(pX + qY ) ≤ H(pX −X1) +H(X + qY )−H(X) (A.66)

≤ H(pX + Y ) +H(qY +X) +H(X + Y )−H(X)−H(Y ). (A.67)

Next we upper bound H(pX + Y ). First we assume that p is positive. If p is even,applying (A.62) yields

H(pX + Y ) ≤ H(p

2X + Y

)+H(2X −X1)−H(X). (A.68)

If p is odd, applying (A.61) with r = 1 then applying (A.62), we have

H(pX + Y )

≤ H((p− 1)X + Y1 + Y ) +H(X − Y )−H(Y ) (A.69)

≤ H

(p− 1

2X + Y1 + Y

)+H(X − Y )−H(Y ) +H(2X −X1)−H(X) (A.70)

≤ H

(p− 1

2X + Y1 + Y

)+ 2H(X − Y ) + 2H(X + Y )− 3H(Y )−H(X), (A.71)

where the last equality holds because

H(2X −X1) ≤ H(X + Y ) +H(X −X1 − Y )−H(Y ) (A.72)

≤ 2H(X + Y ) +H(X − Y )− 2H(Y ), (A.73)

where (A.72) and (A.73) are due to (A.61) and Lemma 35 respectively.

138

Combining (A.68) and (A.71) gives the following upper bound

H(pX + Y )

≤ H(⌊p

2

⌋X + Y1 + Y

)+ 2H(X − Y ) + 2H(X + Y )− 3H(Y )−H(X) (A.74)

≤ H(⌊p

2

⌋X + Y1 + Y

)+ 8H(X + Y )− 5H(Y )− 3H(X), (A.75)

where (A.75) is due to Lemma 37. Repeating the above procedure with p replacedby⌊p2

⌋, we have

H(pX + Y ) ≤ H

(X +

lp∑i=1

Yi

)+ lp(8H(X + Y )− 5H(Y )− 3H(X)) (A.76)

≤ H(X) + lp(9H(X + Y )− 5H(Y )− 4H(X)), (A.77)

where (A.77) is due to Lemma 35. For negative p, the above proof remains valid ifwe apply (A.61) with r = −1 instead of 1. Then (A.77) holds with p replaced by −p.Similarly, we have

H(qY +X) ≤ H(Y ) + lq(9H(X + Y )− 5H(X)− 4H(Y )). (A.78)

Assembling (A.77) – (A.78) with (A.67), we obtain (A.65).

Finally, we prove Theorem 12 by applying Theorem 50.

Proof. First we assume that p = p′ = 1. Substituting [X]m and [Y ]m into (A.65) andapplying Lemma 30, we have

H(p[X]m + q[Y ]m)−H([X]m + [Y ]m)

≤ 9(lp + lq)H([X]m + [Y ]m)− (5lp + 4lq)H([Y ]m)− (5lq + 4lp)H([X]m) +O(1).(A.79)

Dividing both sides by m and sending m→∞ yield

d(pX + qY )− d(X + Y )

≤ 9(lp + lq)d(X + Y )− (5lp + 4lq)d(Y )− (5lq + 4lp)d(X). (A.80)

Then

18(lp + lq)[d(pX + qY )− d(X + Y )]

= [(5lp + 4lq)d(Y )− (5lq + 4lp)d(X)− 9(lp + lq)d(X + Y )]

+ (5lp + 4lq)[d(pX + qY )− d(Y )− d(X + Y )]

+ (5lq + 4lp)[d(pX + qY )− d(X)− d(X + Y )] + 9(lp + lq)d(pX + qY ) (A.81)

≤ − [d(pX + qY )− d(X + Y )] + 9(lp + lq)d(pX + qY ), (A.82)

139

where (A.82) is due to (A.80), (2.21) and (2.11). Therefore

d(pX + qY )− d(X + Y ) ≤ 9(lp + lq)

18(lp + lq) + 1=

1

2− 1

36(lp + lq) + 2, (A.83)

which is the desired (2.96).Finally, by the scale-invariance of information dimension, setting Z = Y

pp′gives

d(pX + qY )− d(p′X + q′Y ) = d(X + p′qZ)− d(X + pq′Z) (A.84)

≤ 1

2− ε(p′q, pq′). (A.85)

Remark 32. For the special case of p = 2 and q = 1, we proceed as follows:

H(2X + Y ) ≤ ∆(X,X) +H(X +X ′ + Y ) (A.86)

≤ ∆(X,X) + 2H(X + Y )−H(Y ). (A.87)

≤ 4H(X + Y )− 2H(Y )−H(X). (A.88)

where

• (A.86): by (A.61);• (A.87): by Lemma 35;• (A.88): by triangle inequality ∆(X,X) ≤ 2∆(X,−Y ).

By (A.88), we have d(2X + Y )− d(X + Y ) ≤ 3d(X + Y )− d(X)− 2d(Y ). Con-sequently, the same manipulation in (A.81) – (A.83) gives

d(2X + Y )− d(X + Y ) ≤ 3

7. (A.89)

Similarly for p = 1 and q = −1, from Lemma 37, we have:

d(X − Y ) ≤ 3d(X + Y )− d(X)− d(Y ), (A.90)

which implies

d(X − Y )− d(X + Y ) ≤ 2

5. (A.91)

A.9 Proof of Theorem 13

The following basic result from the theory of rational approximation and continuedfractions ([148, 149]) is instrumental in proving Theorem 13. For every irrational x,there exists a unique x0 ∈ Z and a unique sequence xk ⊂ N, such that x is given

140

by the infinite continued fraction

x = x0 +1

x1 +1

x2 +.. .

(A.92)

The rational numberpk(x)

qk(x), x0 +

1

x1 +1

x2 +1

. . . +1

xk

, (A.93)

is called the kth-convergent of x. Convergents provide the best rational approximationto irrationals [149].

Lemma 40. Let x be an irrational real number and N a positive integer. Define

EN(x) =

(p, q) ∈ Z2 : |p− qx| ≤ 1

2q, |p|, |q| ≤ N

. (A.94)

Then|EN(x)| ≤ 1 + 2 logN. (A.95)

Proof. By [149, Theorem 19, p. 30], pq

is necessarily a convergent of x. By [149,

Theorem 12, p. 13], for any irrational x, the sequence qk(x) formed by denominators

of successive convergents grows exponentially: qk(x) ≥ 2k−12 for all k ≥ 1,2 which

implies the desired (A.95).

The next lemma shows that convolutions of homogeneous self-similar measureswith common similarity ratio are also self-similar (see also [48, p. 1341] and the refer-ences therein), but with possible overlaps which might violate the open set conditions.Therefore the formula in (2.85) does not necessarily apply to convolutions.

Lemma 41. Let X and X ′ be independently distributed according to µ and µ′ re-spectively, which are homogeneous self-similar with identical similarity ratio r. Thenfor any a, a′ ∈ R, the distribution of aX + a′X ′ is also homogeneous self-similar withsimilarity ratio r.

Proof. According to Remark 3, X and X ′ satisfy XD= rX + W and X ′

D= rX ′ + W ′,

respectively, where W and W ′ are both finitely valued and X,W,X ′,W ′ are inde-

pendent. Therefore aX+a′X ′D= r(aX+a′X ′)+aW +a′W ′, which implies the desired

conclusion.

2In fact, limn→∞ qn(x)1n = γ for Lebesgue a.e. x, where γ = e

π2 log e12 is the Khinchin-Levy

constant [149, p. 66].

141

Proof of Theorem 13. Let λ , p′qpq′

be irrational. In view of the scale-invariance of

information dimension, to establish (2.98), it is equivalent to show

supX⊥⊥Y

d(X + λY )− d(X + Y ) =1

2. (A.96)

Next we construct a sequence of distributions µN such that for X and Y are inde-pendent and both distributed according to µN , we have

d(X + λY )− d(X + Y ) ≥ 1

2−O

(log logN

logN

), N →∞, (A.97)

which, in view of (2.95), implies the desired (A.96).Let N ≥ 2 be an integer. Denote by µN the invariant measure generated by

the IFS (2.73) with rj = 1N2 , wj = N(j − 1) and pj = 1

N, j = 1, . . . , N . Then µN

is homogeneous self-similar and satisfies the open set condition (in fact the strongseparation condition), in view of Lemma 5. By Theorem 11, d(µN) = 1

2for all N .

Alternatively, µN can be defined by the N -ary expansions as follows: Let X bedistributed according to µN . Then the even digits in the N -ary expansion of X areindependent and equiprobable, while the odd digits are frozen to be zero. In viewof the discussion in Section 2.5, the information dimension of µN is the normalizedentropy rate of the digits, which is equal to 1

2.

Let X and Y be independently distributed according to µN . Then

X =∑i≥1

XiN−2i, (A.98)

Y =∑i≥1

YiN−2i, (A.99)

where Xi and Yi are both i.i.d. and equiprobable on N0, . . . , N − 1. ThenX +Y also has a homogeneous self-similar with similarity ratio N−2 and satisfies theopen set condition. Applying Theorem 11 yields

d(X + Y ) =H(X1 + Y1)

2 logN≤ log(2N − 1)

2 logN. (A.100)

Next we prove that d(X + λY ) is close to one. By Lemma 41, X + λY is self-similarhomogeneous with similarity ratio 1

N2 . By Theorem 10, d(X + λY ) exists. Moreover,

d(X + λY ) ≥ H(〈X + λY 〉N2)− 10

2 logN. (A.101)

142

Denote X ′ = 〈X〉N2 and Y ′ = 〈Y 〉N2 . Let N ′ , 4N2. To further lower bound theentropy in (A.101), note that

H(〈X + λY 〉N2) ≥ H(〈X + λY 〉N ′)− log 5 (A.102)

≥ H(〈X ′ + λY ′〉N ′)− log 5− log(3 + 2|λ|) (A.103)

= H(X ′) +H(Y ′)−H(X ′, Y ′|〈X ′ + λY ′〉N ′)− log 5− log(3 + 2|λ|), (A.104)

where (A.102), (A.103) and (A.104) follow from Lemma 28, Lemma 30 and the inde-pendence of X ′ and Y ′, respectively.

Next we upper bound the conditional entropy in (A.104) using Lemma 40. Bythe definition of µN , X ′ is valued in 1

N0, . . . , N − 1. Fix k ∈ Z. Conditioned on

the event〈X ′ + λY ′〉N ′ = k

N ′

, define the set of all possible realizations of (X ′, Y ′):

Ck =( x

N,y

N

):

∣∣∣∣ xN + λy

N− k

N ′

∣∣∣∣ < 1

N ′, x, y ∈ Z+, x < N, y < N

. (A.105)

To upper bound |Ck|, note that if ( xN, yN

) and (u′

N, v′

N) both belong to Ck, then∣∣∣∣x− x′N

+ λy − y′

N

∣∣∣∣ ≤ 2

N ′, (A.106)

which, in view of |y − y′| < N and N ′ = 4N2, implies that

|x− x′ + λ(y − y′)| ≤ 2N

N ′≤ 1

2|y − y′|. (A.107)

In view of (A.94), (A.107) implies that (x− x′, y − y′) ∈ EN(λ). In other words,

Ck − Ck ⊂1

NEN(λ). (A.108)

By assumption, λ is irrational. Consequently, by Lemma 40,

|Ck| ≤ |EN(λ)| ≤ 1 + 2 logN. (A.109)

Since (A.109) holds for all k, we have

H(X ′, Y ′|〈X ′ + λY ′〉N ′) ≤ log(1 + 2 logN). (A.110)

Plugging (A.110) into (A.104) yields

H(〈X + λY 〉N2) ≥ H(〈X〉N2) +H(〈Y 〉N2)−O(log logN). (A.111)

143

Dividing both sides of (A.111) by 2 logN and substituting into (A.101), we have

d(X + λY ) ≥ H(〈X〉N2)

2 logN+H(〈Y 〉N2)

2 logN−O

(log logN

logN

)(A.112)

= 1−O(

log logN

logN

), (A.113)

where we have used H(〈X〉N2) = H(〈Y 〉N2) = logN . Combining (A.100) and(A.113), we obtain the desired (A.96), hence completing the proof of Theorem 13.

144

Appendix B

Proofs in Chapter 4

B.1 Proof of Theorem 19

Proof. Invariance of the MMSE functional under translation is obvious. Hence forany α, β 6= 0,

mmse(αX + γ, βN + η, snr)

= mmse(αX, βN, snr) (B.1)

= mmse(αX|α√snrX + βN) (B.2)

= |α|2mmse

(X∣∣∣ |α||β|√snrX + sgn(αβ)N

)(B.3)

= |α|2mmse

(X, sgn(αβ)N,

|α||β|√snr

)(B.4)

= |α|2mmse

(sgn(αβ)X,N,

|α||β|√snr

). (B.5)

Therefore

D(αX + γ, βN + η)

= lim infsnr→∞

snr ·mmse(αX + γ, βN + η, snr)

var(βN + η)(B.6)

= lim infsnr→∞

snr · |α|2mmse(X, sgn(αβ)N, |α||β|

√snr)

|β|2varN(B.7)

= D(X, sgn(αβ)N) (B.8)

= D(sgn(αβ)X,N), (B.9)

where (B.8) and (B.9) follow from (B.4) and (B.5) respectively. The claims in Theo-rem 19 are special cases of (B.8) and (B.9). The proof for D follows analogously.

145

B.2 Calculation of (4.17), (4.18) and (4.93)

In this appendix we compute mmse(X,N, snr) for three different pairs of (X,N).First we show (4.18), where X is uniformly distributed in [0, 1] and N has the densityin (4.16) with α = 3. Let ε = 1√

snr. Then E[X|X + εN = y] = q1

q0(y), where

q0(y) =1

εE[fN

(y − xε

)](B.10)

=

1− ε2

y2ε < y < 1 + ε

ε2(−1+2y)(−1+y)2y2

y > 1 + ε.(B.11)

and

q1(y) =1

εE[XfN

(y − xε

)](B.12)

=

2ε

32√y

+ y − 3ε ε < y < 1 + ε

2ε32√y

+ (3−2y)ε32

(y−1)32

y > 1 + ε.(B.13)

Then

mmse(X,N, snr)

= E[X2]− E[(E[X|X + εN ])2

](B.14)

= E[X2]−∫ ∞

0

q21(y)

q0(y)dy (B.15)

= 2ε2[log

(1 +

1

2ε

)− 2 + 8ε coth−1(1 + 4ε)

], (B.16)

where we have used E [X2] = 13. Taking Taylor expansion on (B.16) at ε = 0 yields

(4.18). For α = 2, (4.17) can be shown in similar fashion.To show (4.93), where X and N are both uniformly distributed in [0, 1], we note

that

q0(y) =1

ε[miny, 1 − (y − ε)+] (B.17)

q1(y) =1

2ε[miny2, 1 − ((y − ε)+)2], (B.18)

where (x)+ , maxx, 0. Then (4.93) can be obtained using (B.15).

146


The outline of the proof is as follows. From

mmse(X, snr|U) ≤ mmse(X, snr), (B.19)

we immediately obtain the inequalities in (4.61) and (4.62). Next we prove thatequalities hold if U is discrete. Let U = ui : i = 1, . . . , n denote the alphabet ofU with n ∈ N ∪ ∞, αi = P U = ui. Denote by µi the distribution of X givenU = ui. Then the distribution of X is given by the following mixture:

µ =n∑i=1

αiµi. (B.20)

Our goal is to establish

D(µ) =n∑i=1

αiD(µi), (B.21)

D(µ) =n∑i=1

αiD(µi). (B.22)

After recalling an important lemma due to Doob in Appendix B.3.1, we decomposethe proof of (B.21) and (B.22) into four steps, which are presented in Appendices B.3.2– B.3.5 respectively:

1. We prove the special case of n = 2 and µ1 ⊥ µ2;

2. We consider µ1 µ2;

3. Via the Hahn-Lebesgue decomposition and induction on n, the conclusion isextended to any finite mixture;

4. We prove the most general case of countable mixture (n =∞).

B.3.1 Doob’s relative limit theorem

The following lemma is a combination of [35, Exercise 2.9, p.243] and [150, The-orem 1.6.2, p.40]:

Lemma 42. Let µ and ν be two Radon measures on Rn. Define the density of µ withrespect to ν by

Dµ

Dν(x) = lim

ε↓0

µ(B(x, ε))

ν(B(x, ε)), (B.23)

where B(x, ε) denotes the open ball of radius ε centered at x. If µ ⊥ ν, then

Dµ

Dν= 0, ν-a.e. (B.24)

147

If µ ν, thenDµ

Dν=

dµ

dν, µ-a.e. (B.25)

The Lebesgue-Besicovitch differentiation theorem is a direct consequence ofLemma 42:

Lemma 43 ([150, Corollary 1.7.1]). Let ν be a Radon measure on Rn and g ∈L1

loc(Rn, ν). Then

limε↓0

1

ν(B(x, ε))

∫B(x,ε)

|g(y)− g(x)|ν(dy) = 0 (B.26)

holds for ν-a.e. x ∈ Rn.

It is instructive to reformulate Lemma 42 in a probabilistic context: Suppose µ andν are probability measures and X and Z are random variables distributed accordingto µ and ν respectively. Let N be uniformly distributed in [−1, 1] and independentof X,Z. Then X + εN has the following density:

fX+εN(x) =1

2εµ(B(x, ε)), (B.27)

hence the density of µ with respect to ν can be written as

Dµ

Dν(x) = lim

ε↓0

fX+εN(x)

fZ+εN(x). (B.28)

A natural question is whether (B.28) still holds if N has a non-uniform distribution.In [151, Theorem 4.1], Doob gave a sufficient condition for this to be true, which issatisfied in particular by Gaussian-distributed N [151, Theorem 5.2]:

Lemma 44. For any z ∈ R, let ϕz(·) = ϕ(z+·). Under the assumption of Lemma 42,if ∫

ϕz

(y − xε

)µ(dy),

∫ϕz

(y − xε

)ν(dy) (B.29)

are finite for all ε > 0 and x ∈ R, then

Dµ

Dν(x) = lim

ε↓0

∫ϕz(y−xε

)µ(dy)∫

ϕz(y−xε

)ν(dy)

(B.30)

holds for ν-a.e. x.

Consequently we have counterparts of Lemmas 42 and 43:

Lemma 45. Under the condition of Lemma 42, if µ ⊥ ν, then

limε↓0

∫ϕz(y−xε

)µ(dy)∫

ϕz(y−xε

)ν(dy)

= 0, ν-a.e. (B.31)

148

If µ ν, then

limε↓0

∫ϕz(y−xε

)µ(dy)∫

ϕz(y−xε

)ν(dy)

=dµ

dν, µ-a.e. (B.32)

Lemma 46. Under the condition of Lemma 43,

limε↓0

∫|g(y)− g(x)|ϕz

(y−xε

)ν(dy)∫

ϕz(y−xε

)ν(dy)

= 0 (B.33)

holds for ν-a.e. x.

B.3.2 Mixture of two mutually singular measures

We first present a lemma which enables us to truncate the input or the noise. Thepoint of this result is that the error term depends only on the truncation thresholdbut not on the observation.

Lemma 47. For K > 0 define XK = X1|X|≤K and XK = X − XK. Then for allY ,

|mmse(XK |Y )−mmse(X|Y )| ≤ 3∥∥XK

∥∥2‖X‖2 . (B.34)

Proof.

mmse(XK |Y ) = ‖XK − E[XK |Y ]‖22 (B.35)

≤(‖X − E[X|Y ]‖2 +

∥∥XK − E[XK |Y ]∥∥

2

)2(B.36)

≤(mmse(X|Y )

12 +

√varXK

)2

(B.37)

= mmse(X|Y ) + varXK + 2√varXK mmse(X|Y ) (B.38)

≤ mmse(X|Y ) + 3∥∥XK

∥∥2‖X‖2 . (B.39)

The other direction of (B.34) follows in entirely analogous fashion.

Let Xi be a random variable with distribution µi, for i = 1, 2. Let U be a randomvariable independent of X1, X2, taking values on 1, 2 with probability 0 < α1 < 1and α2 = 1− α2 respectively. Then the distribution of XU is

µ = α1µ1 + α2µ2. (B.40)

Fixing M > 0, we define

gu,ε(y) = E[ϕ

(y −Xu

ε

)], (B.41)

fu,ε(y) = E[y −Xu

εϕ

(y −Xu

ε

)1| y−Xuε |≤M

](B.42)

149

and

gε = α1g1,ε + α2g2,ε, (B.43)

fε = α1f1,ε + α2f2,ε. (B.44)

Then the densities of Yu = Xu + εNG and YU = XU + εNG are respectively given by

qu,ε(y) =1

εgu,ε(y) (B.45)

qε(y) =1

εgε(y). (B.46)

We want to show

mmse(XU , snr)−mmse(XU , snr|U) = o

(1

snr

), (B.47)

i.e., the benefit of knowing the true distribution is merely o(

1snr

)in the high-SNR

regime. To this end, let ε = 1√snr

and define

W = NG1|NG|≤M, (B.48)

AM(ε) = mmse(W |YU)−mmse(W |YU , U) ≥ 0. (B.49)

By the orthogonality principle,

AM(ε) = E[(W − W (YU))2]− E[(W − W (YU , U))2] (B.50)

= E[(W (YU)− W (YU , U))2] (B.51)

where

W (y, u) = E[W |YU = y, U = u] =fu,ε(y)

gu,ε(y)(B.52)

and

W (y) = E[W |YU = y] =fε(y)

gε(y). (B.53)

150

Therefore

AM(ε) =2∑

u=1

αu E[(W (YU , u)− W (YU))2|U = u] (B.54)

=2∑

u=1

αu E

[(fu,εgu,ε− fεgε

)2

(YU)∣∣∣U = u

](B.55)

= α1α22 E

[(f1,εg2,ε − f2,εg1,ε

g1,εgε

)2

(Y1)

]+

α2α21 E

[(f1,εg2,ε − f2,εg1,ε

g2,εgε

)2

(Y2)

](B.56)

=α1α2

ε

∫(f1,εg2,ε − f2,εg1,ε)

2

gεg1,εg2,ε

dy (B.57)

= α1α2 E[g1,ε(YU)g2,ε(YU)

g2ε (YU)

(W (YU , 1)− W (YU , 2))2

](B.58)

where

• (B.55): by (B.52) and (B.53).• (B.56): by (B.43) and (B.44).• (B.57): by (B.45) and (B.46).

Next we show that as ε→ 0, the quantity defined in (B.49) vanishes:

AM(ε) = o (1). (B.59)

Indeed,

AM(ε) ≤ 4M2α1α2E[g1,εg2,ε

g2ε

YU]

(B.60)

≤ 4M2α1α2

(E[g2,ε

gε Y1

]+ E

[g1,ε

gε Y2

]), (B.61)

where

• (B.60): by (B.58) and|W (y, u)| ≤M (B.62)

• (B.61): by (B.43) and (B.44), we have for u = 1, 2,

gu,εgε≤ 1

αu. (B.63)

Write

E[g2,ε

gε Y1

]=

∫ϕ(z)dz

∫g2,ε

gε(x+ εz)µ1(dx). (B.64)

151

Fix z. Sinceg1,ε

g2,ε

(x+ εz) =E[ϕz(y−X1

ε

)]E[ϕz(y−X2

ε

)] , (B.65)

and µ1 ⊥ µ2, applying Lemma 45 yields

limε↓0

g2,ε

g1,ε

(x+ εz) = 0, (B.66)

for µ1-a.e. x. Thereforeg2,ε

gε(x+ εz) = o (1) (B.67)

for µ1-a.e. x. In view of (B.64) and (B.63), we have

E[g2,ε

gε Y1

]= o (1) (B.68)

by the dominated convergence theorem, and in entirely analogous fashion:

E[g1,ε

gε Y2

]= o (1). (B.69)

Substituting (B.68) and (B.69) into (B.61), we obtain (B.59).By Lemma 47,

lim supε↓0

[mmse(NG|YU)−mmse(NG|YU , U)]

≤ limε↓0

AM(ε) + 6 ‖NG‖2

∥∥NG1|NG|>M∥∥

2(B.70)

= 6∥∥NG1|NG|>M

∥∥2. (B.71)

By the arbitrariness of M , letting ε = 1√snr

yields

0 = lim supε↓0

[mmse(NG|YU)−mmse(NG|YU , U)] (B.72)

= lim supsnr↓0

snr (mmse(XU , snr)−mmse(XU , snr|U)), (B.73)

which gives the desired result (B.47).

B.3.3 Mixture of two absolutely continuous measures

Now we assume µ1 µ2. In view of the proof in Appendix B.3.2, it is sufficientto show (B.59). Denote the Radon-Nikodym derivative of µ1 with respect to µ2 by

h =dµ1

dµ2

, (B.74)

where h : R→ R+ satisfies∫hdµ2 = 1.

152

From (B.57), we have

AM(ε) = α1α2E

[g1,ε

gε

(f1,ε

g1,ε

− f2,ε

g2,ε

)2

Y2

](B.75)

= α1α2

∫ϕ(z)dz

∫F c

g1,ε

gε

(f1,ε

g1,ε

− f2,ε

g2,ε

)2

(x+ εz)µ2(dx) (B.76)

+ α1α2

∫ϕ(z)dz

∫F

(f1,εg2,ε − f2,εg1,ε)2

gεg1,εg22,ε

(x+ εz)µ2(dx), (B.77)

whereF = x : h(x) > 0. (B.78)

If h(x) = 0, applying Lemma 45 we obtain

g1,ε

gε(x+ εz) = o (1) (B.79)

for µ2-a.e. x and every z. In view of (B.62), we conclude that the integrand in (B.76)is also o (1).

If h(x) > 0, then

|(f1,εg2,ε − f2,εg1,ε)(x+ εz)| (B.80)

=∣∣∣ ∫ y − x1

εϕz

(y − x1

ε

)1| y−x1ε |≤Mh(x1)µ2(dx1)

∫ϕz

(y − x2

ε

)µ2(dx2) (B.81)

−∫y − x1

εϕz

(y − x1

ε

)1| y−x1ε |≤Mµ2(dx1)

∫ϕz

(y − x2

ε

)h(x2)µ2(dx2)

∣∣∣(B.82)

≤ M

∫∫|h(x1)− h(x2)|ϕz

(y − x1

ε

)ϕz

(y − x2

ε

)µ2(dx1)µ2(dx2) (B.83)

≤ 2Mg2,ε(x+ εz)

∫|h(x1)− h(x)|ϕz

(y − x1

ε

)µ2(dx1) (B.84)

=o(g2

2,ε(x+ εz))

(B.85)

for µ2-a.e. x and every z, which follows from applying Lemma 46 to h ∈ L1(R, µ2).By Lemma 45, we have

limε↓0

g22,ε

g1,εgε(x) =

1

h(x)(α1 + α2h(x))(B.86)

Combining (B.85) and (B.86) yields that the integrand in (B.77) is o (1). Since theintegrand is bounded by 4M2

α1, (B.47) follows from applying the dominated convergence

theorem to (B.75).

153

B.3.4 Finite Mixture

Dealing with the more general case where µ1 and µ2 are arbitrary probabilitymeasures, we perform the Hahn-Lebesgue decomposition [150, Theorem 1.6.3] on µ1

with respect to µ2, which yields

µ1 = β1νc + β2νs, (B.87)

where 0 ≤ β1 = 1 − β2 ≤ 1 and νc and νs are two probability measures such thatνc µ2 and νs ⊥ µ2. Consequently, νs ⊥ νc. Therefore

µ = α1µ1 + α2µ2 (B.88)

= β1(α1νc + α2µ2) + β2(α1νs + α2µ2) (B.89)

= β1µ∗ + β2ν

∗ (B.90)

where

µ∗ = α1νc + α2µ2, (B.91)

ν∗ = α1νs + α2µ2. (B.92)

Then

D(µ) = β1D(µ∗) + β2D(ν∗) (B.93)

= β1[α1D(νc) + α2D(µ2)] + β2[α1D(νs) + α2D(µ2)] (B.94)

= α1[β1D(νc) + β2D(νs)] + α2D(µ2) (B.95)

= α1D(β1νc + β2νs) + α2D(µ2) (B.96)

= α1D(µ1) + α2D(µ2), (B.97)

where

• (B.93): applying the results in Appendix B.3.3 to (B.90), since by assumptionα2 > 0, we have µ∗ ν∗.

• (B.94), (B.96): applying the results in Appendix B.3.2 to (B.91), (B.92) and(B.87), since νc µ2, νs ⊥ µ2 and νs ⊥ νc.

Similarly, D(µ) = α1D(µ1)+α2D(µ2). This completes the proof of (B.21) and (B.22)for n = 2.

154

Next we proceed by induction on n: Suppose that (B.21) holds for n = N . Forn = N + 1, assume that αN+1 < 1, then

D

(N+1∑i=1

αiµi

)= D

((1− αN+1)

N∑i=1

αi1− αN+1

µi + αN+1µN+1

)(B.98)

= (1− αN+1)D

(N∑i=1

αi1− αN+1

µi

)+ αN+1D(µN+1) (B.99)

=N+1∑i=1

αiD(µi), (B.100)

where (B.99) and (B.100) follow from the induction hypothesis. Therefore, (B.21)and (B.22) hold for any n ∈ N.

B.3.5 Countable Mixture

Now we consider n =∞: without loss of generality, assume that∑N

i=1 αi < 1 forall N ∈ N. Then

D(µ) =N∑i=1

αiD(µi) + (1−N∑i=1

αi)D(νN) (B.101)

≤N∑i=1

αiD(µi) + (1−N∑i=1

αi), (B.102)

where

• (B.101): we have denoted νN =∑∞i=N+1 αiµi

1−∑Ni=1 αi

.

• (B.102): by Theorem 18.

Sending N → ∞ yields D(µ) ≤∑∞

i=1 αiD(µi), and in entirely analogous fashion,D(µ) ≥

∑∞i=1 αiD(µi). This completes the proof of Theorem 21.

Remark 33. Theorem 21 also generalizes to non-Gaussian noise. From the aboveproof, we see that (B.47) holds for all noise densities that satisfy Doob’s relative limittheorems, in particular, those meeting the conditions in [151, Theorem 4.1], e.g.,uniform (by (B.28)) and exponential and Cauchy density ([151, Theorems 5.1]).

More generally, notice that Lemma 44 deals with convolutional kernels whichcorrespond to additive-noise channels. In [151, Theorem 3.1], Doob also gave a resultfor general kernels. Therefore, it is possible to extend the results in Theorem 21 togeneral channels .

155


Proof. Let pi = P X = xi, ε = 1√snr

and Yε = X + εN . In view of (4.45), it isequivalent to show that

mmse(N |Yε) = o (1). (B.103)

Fix δ > 0. Since N ∈ L2, there exists K > 0, such that

E[N21|N |>K

]< δ/2. (B.104)

Since fN(N) <∞ a.s., we can choose J > 0 such that

P fN(N) > J < δ

2K2. (B.105)

DefineEδ = z : fN(z) ≤ J, |z| ≤ K. (B.106)

and Nδ = N1Eδ(N). Then we have

E[(N −Nδ)

2]≤ K2P fN(N) > J+ E

[N21|N |>K

](B.107)

≤ δ, (B.108)

where (B.108) follows from (B.104) and (B.105).The optimal estimator for Nδ based on Yε is given by

Nδ(y) =

∑j pj

y−xjε

1Eδ(y−xj

ε

)fN(y−xj

ε

)∑j pjfN

(y−xjε

) (B.109)

Then the MMSE of estimating Nδ based on Yε is

mmse(Nδ|Yε) =∑i

piE[(Nδ − Nδ(xi +N))2

](B.110)

=∑i

pi

∫Eδ

(z − Nδ(xi + z))2fN(z)dz

+∑i

pi

∫Ecδ

Nδ(xi + z)2fN(z)dz (B.111)

≤∑i

pi

∫Eδ

gz,i(ε)fN(z)dz +K2P N /∈ Eδ (B.112)

≤ J∑i

pi

∫ K

−Kgz,i(ε)dz + δ (B.113)

where

• (B.112): by |Nδ(y)| ≤ K, since Nδ ≤ K a.s. We have also defined

gz,i(ε) = (z − Nδ(xi + z))2. (B.114)

156

• (B.113): by (B.106) and

K2P N /∈ Eδ≤ K2P fN(N) > J+K2P |N | > K (B.115)

≤ K2P fN(N) > J+ E[N21|N |>K

](B.116)

≤ δ, (B.117)

where (B.117) follows from (B.104) and (B.105).

Next we show that for all i and z ∈ Eδ, as ε→ 0, we have

gz,i(ε) = o (1). (B.118)

Indeed, using (B.109),

gz,i(ε) =

[∑j pj

xi−xjε

1Eδ(xi−xj

ε+ z)fN(xi−xj

ε+ z)∑

j pjfN(xi−xj

ε+ z) ]2

(B.119)

≤

[∑j 6=i pj

xi−xjε

1Eδ(xi−xj

ε+ z)fN(xi−xj

ε+ z)]2

p2i fN(z)2

(B.120)

≤ J2(K + |z|)2

p2i fN(z)2

[∑j 6=i

pj1Eδ

(xi − xj

ε+ z

)]2

(B.121)

≤[J(K + |z|)pifN(z)

PX 6= xi,

∣∣∣∣xi −Xε + z

∣∣∣∣ ≤ K

]2

(B.122)

= o (1), (B.123)

where

• (B.121): by (B.106).• (B.123): by the boundedness of Eδ and the dominated convergence theorem.

By definition in (B.114), we have

gz,i(ε) ≤ (z +K)2. (B.124)

In view of (B.113) and dominated convergence theorem, we have

lim supε→0

mmse(Nδ|Yε) ≤ δ (B.125)

Then

lim supε→0

√mmse(N |Yε) ≤ lim sup

ε→0

∥∥∥N − Nδ(Yε)∥∥∥

2(B.126)

≤ lim supε→0

√mmse(Nδ|Yε) + ‖N −Nδ‖2 (B.127)

≤ 2√δ (B.128)

157

where

• (B.126): by the suboptimality of Nδ.• (B.126): by the triangle inequality.• (B.128): by (B.108) and (B.125).

By the arbitrariness of δ, the proof of (B.103) is completed.

B.5 Proof of Theorems 23 – 25

We first compute the optimal MSE estimator under absolutely continuous noiseN . Let ε = 1√

snr. The density of Yε = X + εN is

q0(y) =1

εE[fN

(y −Xε

)]. (B.129)

Denote

q1(y) =1

εE[XfN

(y −Xε

)]. (B.130)

Then the MMSE estimator of X given Yε is given by

X(y) = E[X|Yε = y] =q1(y)

q0(y)=

E[XfN

(y−Xε

)]E[fN(y−Xε

)] . (B.131)

B.5.1 Proof of Theorem 24

Proof. By (4.44), we have

snr ·mmse(X,N, snr) = mmse(N |Yε), (B.132)

Due to Theorem 18, we only need to show

D(X,N) ≥ 1, (B.133)

which, in view of (B.132), is equivalent to

lim infε↓0

mmse(N |Yε) ≥ varN. (B.134)

The optimal estimator for N given Yε is given by

E[N |Yε = y] =E[NfX(y − εN)]

E[fX(y − εN)](B.135)

Fix an arbitrary positive δ. Since N ∈ L2(Ω), there exists M > 0, such that

E[N21|N |>M

]< δ2. (B.136)

158

Then √mmse(N |Yε)

=√

E[(N − E[N |Yε])2] (B.137)

≥√

E[(N − E[N1|N |≤M|Yε])2]−√

E[(E[N1|N |>M|Yε])2] (B.138)

≥√

E[(N − E[N1|N |≤M|Yε])2]−√

E[N21|N |>M] (B.139)

≥√EM,ε − δ (B.140)

where

• (B.138): by writing E[N |Yε] = E[N1|N |≤M|Yε] + E[N1|N |>M|Yε] and thetriangle inequality.

• (B.139): by E[(E[U |V ])2] ≤ E[U2] for all U, V ∈ L2(Ω).• (B.140): by (B.136) and

EM,ε , E[(N − E[N1|N |≤M|Yε])2]. (B.141)

Define

pM(y; ε) = E[N1|N |≤MfX(y − εN)] (B.142)

q(y; ε) = E[fX(y − εN)] (B.143)

NM(y; ε) = E[N1|N |≤M|Yε = y] =pM(y; ε)

q(y; ε). (B.144)

Suppose f ∈ Cb. Then by the bounded convergence theorem,

limε↓0

pM(x+ εz; ε) = E[N1|N |≤M]fX(x), (B.145)

limε↓0

q(x+ εz; ε) = fX(x) (B.146)

hold for all x, z ∈ R. Since fX(X) > 0 a.s.,

limε↓0

E[N1|N |≤M|Yε] = limε↓0

NM(X + εN ; ε) (B.147)

= E[N1|N |≤M] (B.148)

holds a.s. Then by Fatou’s lemma:

lim infε↓0

EM,ε ≥ E[N − E[N1|N |≤M]

]2 ≥ varN. (B.149)

159

By (B.140),

lim infε↓0

√mmse(N |Yε) ≥ lim inf

ε↓0

√EM,ε − δ (B.150)

≥√varN − δ. (B.151)

By the arbitrariness of δ, we conclude that

lim infε↓0

mmse(N |Yε) ≥ varN, (B.152)

hence (B.133) holds.


Now we are dealing withX whose density is not necessarily continuous or bounded.In order to show that (B.145) and (B.146) continue to hold under the assumptionsof the noise density in Theorem 23, we need the following lemma from [152, Section3.2]:

Lemma 48 ([152, Theorem 3.2.1]). Suppose the family of functions Kε : Rd → Rε>0

satisfies the following conditions: for some constant η > 0 and C ∈ R,∫RdKε(x)dx = C (B.153)

supx∈Rd,ε>0

ε|Kε(x)| < ∞ (B.154)

supx∈Rd,ε>0

|x|1+η

εη|Kε(x)| < ∞ (B.155)

hold for all ε > 0 and x ∈ Rd. Then for all f ∈ L1loc(Rd),

limε↓0

f ∗Kε(x) = Cf(x) (B.156)

holds for Lebesgue-a.e. x.

Note that in the original version of Lemma 48 in [152, Section 3.2] C = 1, and Kε

is dubbed approximation of the identity. For C 6= 0 or 1, the same conclusion followsfrom scaling. The case of C = 0 can be shown as follows: take some kernel Gε whichis an approximation to the identity. Then G + K is also an approximation to theidentity. Then the conclusion for K follows by applying Lemma 48 to both G andG+K and then subtracting the corresponding (B.156) limits.

Proof of Theorem 23. Based on the proof of Theorem 24, it is sufficient to show that(B.145) and (B.146) hold for Lebesgue-a.e. x and z. Fix z. First look at (B.146):introduce the following kernel which corresponds to the density of ε(N − z)

Kε(x) =1

εfN

(xε

+ z). (B.157)

160

We check that Kε is an approximation to the identity by verifying:

• (B.153):∫RKε(x)dx =

∫R fN(u)du = 1.

• (B.154): supx∈R,ε>0 ε|Kε(x)| = supu∈R fN(u) <∞, since fN is bounded.• (B.155): by boundedness of fN and (4.69), we have: for some η > 0,

supu∈R|u|1+ηfN(u) < ∞ (B.158)

then(sup

x∈R,ε>0

|x|1+η

εη|Kε(x)|

) 11+η

= supu∈R|u|f

11+η

N (u+ z) (B.159)

≤ supu∈R|u|f

11+η

N (u) + |z| supu∈R

f1

1+η

N (u) (B.160)

< ∞. (B.161)

Note thatq(x+ εz; ε) = fX ∗Kε(x). (B.162)

Since fX ∈ L1(R), by Lemma 48, (B.146) holds for Lebesgue-a.e. x. Similarly, wedefine

Gε(x) =(xε

+ z)

1|xε+z|≤M1

εfN

(xε

+ z). (B.163)

Then it is verifiable that Gε satisfies (B.153) – (B.155), with∫Gε(x)dx = E[N1|N |≤M]. (B.164)

SincepM(x+ εz; ε) = fX ∗Gε(x), (B.165)

Lemma 48 implies that (B.145) holds for Lebesgue-a.e. x.


Proof. Without loss of generality, we shall assume that EN = 0. Following thenotation in the proof of Theorem 24, similar to q(y; ε) in (B.143), we define

p(y; ε) = E[NfX(y − εN)]. (B.166)

Then

E[N |Yε = y] =p(y; ε)

q(y; ε). (B.167)

161

Fix x, z ∈ R. Since fX , f′X and f ′′X are all bounded, we can interchange derivatives

with integrals [101, Theorem 2.27] and write

q(x+ εz; ε)

= E[fX(x+ ε(z −N))] (B.168)

= fX(x) + εf ′X(x)E(z −N) +ε2

2f ′′X(x)E(z −N)2 + o

(ε2)

(B.169)

= fX(x) + εf ′X(x)z +ε2

2f ′′X(x)(z2 + varN) + o(ε2). (B.170)

Similarly, since E[|N |3] <∞, we have

p(x+ εz; ε)

= E[NfX(x+ ε(z −N))] (B.171)

= − εf ′X(x) varN +ε2

2f ′′X(x)(E[N3]− 2z varN) + o

(ε2). (B.172)

Define the score function ρ(x) =f ′XfX

(x). Since fX(X) > 0 a.s., by (B.167),

E[N |Yε] =ε2

2

[f ′′X(X)

fX(X)(E[N3]− 2NvarN) + 2Nρ2(X)varN

]− ερ(X)varN + o

(ε2). (B.173)

holds a.s. Then

mmse(N |Yε) = E[N − E[N |Yε]]2 (B.174)

= varN − εE [N ]E[ρ(X)]varN + ε2(varN)2E[ρ2(X)]− 2ε2(varN)2J(X)

+ ε2(E[N ]E[N3]− 2var2N)E[f ′′X(X)

fX(X)

]+ o

(ε2)

(B.175)

= varN − ε2(varN)2J(X) + o(ε2), (B.176)

where

• (B.175): by the bounded convergence theorem, because the ratio between the

o (ε2) term in (B.173) and ε2 is upper bounded by ρ2(X) andf ′′X(X)

fX(X), which are

integrable by assumption.

• (B.176): by E [ρ(X)] = 0, E [ρ2(X)] = J(X) and E[f ′′X(X)

fX(X)

]= 0, in view of

(4.84).

162


By Remark 7, Theorem 28 holds trivially if P X = 0 = 1 or P X = 1 = 1.Otherwise, since X = 0 if and only if (X)j = 0 for all j ∈ N, but (X)j are i.i.d.,therefore P X = 0 = 0. Similarly, P X = 1 = 0. Hence 0 < X < 1 a.s.

To prove (4.105), we define the function G : R→ R+

G(b) = M2b ·mmse(X,N,M2b) (B.177)

= mmse(N,X,M−2b), (B.178)

where (B.178) follows from (4.44). The oscillatory behavior of G is given in thefollowing lemma:

Lemma 49.

1. For any b ∈ R, G(k + b) : k ∈ N is an nondecreasing nonnegative sequencebounded from above by varN .

2. Define a function Ψ : R→ R+ by

Ψ(b) = limk→∞

G(k + b). (B.179)

Then Ψ is a 1-periodic function, and the convergence in (B.179) is uniform inb ∈ [0, 1]. Therefore as b→∞,

G(b) = Ψ(b) + o (1) (B.180)

In view of Lemma 49, we define a 2 logM -periodic function ΦX,N : R → [0, 1] asfollows:

ΦX,N(t) =1

varNΨ

(t

2 logM

). (B.181)

Having defined ΦX,N , (4.105), (4.106) and (4.107) readily follow from (B.180).Next we prove (4.108) in the case of GaussianN : in view of (B.181), it is equivalent

to show ∫ 1

0

Ψ(b)db = d(X). (B.182)

Denotesnr = M2b. (B.183)

Recalling G(b) defined in (B.178), we have∫ b

0

G(τ)dτ =

∫ snr

1

γmmse(X, γ)1

2γ logMdγ (B.184)

=I(snr)− I(1)

2 logM, (B.185)

163

where (B.185) follows from (4.49) and I(snr) is defined in (4.50).Since d(X) exists, in view of (2.66) and (2.67), we have

limb→∞

1

b

∫ b

0

G(τ)dτ = limsnr→∞

2I(snr)

log snr= d(X). (B.186)

Since the convergence in (B.179) is uniform in b ∈ [0, 1], for all ε > 0, there exists k0

such that for all k ≥ k0 and b ∈ [0, 1],

|G(k + b)− ΦX,NG(b)| ≤ ε. (B.187)

Then for all integers k ≥ k0, we have∣∣∣∣1k∫ k

0

G(τ)dτ −∫ 1

0

Ψ(τ)dτ

∣∣∣∣ ≤ 2k0

k+

1

k

k−1∑j=k0

∫ 1

0

|G(j + τ)−Ψ(τ)|dτ (B.188)

≤ 2k0

k+k − k0

kε (B.189)

where

• (B.188): G and Ψ map into [0, 1].• (B.189): by (B.187).

By the arbitrariness of ε and (B.186), we have∫ 1

0

Ψ(τ)dτ = limk→∞

1

k

∫ k

0

G(τ)dτ = d(X). (B.190)

To finish the proof, we prove Lemma 49. Note that

G(k + 1 + b) = mmse(N,X,M−2(k+1+b)) (B.191)

= mmse(N |N +√snrMk+1X) (B.192)

= mmse(N |N +√snrMk(U + V )) (B.193)

≥ mmse(N |N +√snrMkV ) (B.194)

= mmse(N |N +√snrMkX) (B.195)

= G(k + b) (B.196)

where

• (B.191): by (B.178);• (B.192): by (4.43) and (B.183);• (B.193): by the M -ary expansion of X in (2.47) and we have defined

U = (X)1, (B.197)

V =∞∑j=1

(X)j+1M−j; (B.198)

164

• (B.194): by the data-processing inequality of MMSE [87], since U is indepen-dent of V ;

• (B.195): by VD=X since (X)j is an i.i.d. sequence.

Therefore for fixed b, G(k + b) is an nondecreasing sequence in k. By (B.191),

G(k + b) ≤ varN, (B.199)

hence limk→∞G(k+b) exists, denoted by Ψ(b). The 1-periodicity of Ψ readily follows.To prove (B.180), we show that there exist two functions c1, c2 : R→ R+ depend-

ing on the distribution of X and N only, such that

limb→∞

ci(b) = 0, i = 1, 2 (B.200)

and

0 ≤ Ψ(b)−G(b) (B.201)

≤ c21(b) + c2(b) + 2c1(b)

√varN + c2(b). (B.202)

Then (B.180) follows by combining (B.200) – (B.202) and sending b→∞. Inequalities(B.201) and (B.202) also show that the convergence in (B.179) is uniform in b ∈ [0, 1].

To conclude this proof, we proceed to construct the desired functions c1 and c2

and prove (B.201) and (B.202): by monotonicity, for all b ∈ R,

G(b) ≤ Ψ(b). (B.203)

Hence (B.201) follows. To prove (B.202), fix k ∈ N and K > 0 to be specified later.Define

NK = N1|N |≤K (B.204)

NK = N −NK (B.205)

We use a suboptimal estimator to bound G(k + b) − G(b). To streamline the proof,introduce the following notation:

W = Mk[X]k (B.206)

Z = Mk(X − [X]k) (B.207)

Y =√snrX +N (B.208)

Y ′ =√snrMkX +N (B.209)

=√snrW +

√snrZ +N, (B.210)

where W is integer-valued. Note that since X has i.i.d. M -ary expansion, N,W,Zare independent, and

ZD=X, (B.211)

165

henceY ′

D= Y +

√snrW. (B.212)

Based on Y ′, we use the following two-stage suboptimal estimator NK for NK :first estimate W (the first k bits of X) based on Y ′ according to

w(y) =

[y√snr

]. (B.213)

Then peel off W = w(Y ′) from Y ′ and plug it into the optimal estimator for X basedon Y

NK(y) = E[NK |Y = y] (B.214)

to estimate X, i.e.,NK(y) = NK(y −

√snrw(y)). (B.215)

Next we bound the probability of choosing the wrong W :

PW 6= W

= 1− P

[N√snr

+W + Z

]= W

(B.216)

= 1− PW ≤ N√

snr+W + Z < W + 1

(B.217)

= 1− P−√snrX ≤ N <

√snr(1−X)

(B.218)

=

∫(0,1)

PX(dx)[1− P−√snrx ≤ N <

√snr(1− x)

] (B.219)

, κ(snr)→ 0, (B.220)

where

• (B.216): by (B.210) and (B.213);• (B.217): by the fact that [x] = n if and only if n ≤ x < n+ 1;• (B.218): by (B.211);• (B.219): by the assumption that X ∈ (0, 1) a.s.;• (B.220): by the bounded convergence theorem.

Note that κ(snr) is a nonincreasing nonnegative function depending only on the dis-tribution of X and N .

Finally we choose K and analyze the performance of NK . Observe from (B.216)that the probability of choosing the wrong W does not depend k. This allows us tochoose K independent of k:

K = [κ(snr)]−13 . (B.221)

Therefore, √G(k + b) =

√mmse(N |Y ′) (B.222)

≤∥∥N −NK +NK − NK(Y ′)

∥∥2

(B.223)

≤∥∥NK

∥∥2

+∥∥NK − NK(Y ′ −

√snrW )

∥∥2

(B.224)

166

where

• (B.222): by (B.178) and (B.210);• (B.223): by the suboptimality of NK ;• (B.224): by (B.215).

Now

E[(NK − NK(Y ′ −

√snrW ))2

]≤ E[NK − NK(Y )]2+

E[(NK − NK(Y ′ −

√snrW ))21W 6=W

](B.225)

≤ mmse(NK |Y ) + 4K2PW 6= W

(B.226)

≤ G(b) + 3∥∥NK

∥∥2‖N‖2 + 4κ(snr)

13 (B.227)

where

• (B.225): by (B.212);• (B.226): by |NK(y)| = |E[NK |Y = y]| ≤ K for all y, since |NK | ≤ K a.s.;• (B.227): by Lemma 47, (B.220), (B.221) and mmse(N |Y ) = G(b).

Define

c1(b) =∥∥NK

∥∥2

(B.228)

c2(b) = 3∥∥NK

∥∥2‖N‖2 + 4κ(snr)

13 (B.229)

where b and K are related to snr through (B.183) and (B.221), respectively. Thensubstituting (B.227) into (B.224) yields√

G(k + b) ≤ c1(b) +√G(b) + c2(b). (B.230)

Note that the right hand side of (B.230) does not depend on k. By (B.179), sendingk →∞ we obtain √

Ψ(b) ≤ c1(b) +√G(b) + c2(b). (B.231)

In view of (B.203), squaring1 (B.231) on both sides and noticing that G(b) ≤ varNfor all b, we have

Ψ(b)−G(b) ≤ c21(b) + c2(b) + 2c1(b)

√varN + c2(b). (B.232)

By (B.220) and (B.221), K →∞ as b→∞. Since E [N2] <∞, c1(b) and c2(b) bothtend to zero as b→∞, and (B.202) follows.

1In this step it is essential that G(b) be bounded, because in general√an =

√bn + o (1) does not

imply that an = bn + o (1). For instance, an = n+ 1, bn = n.

167

B.7 Proof for Remark 8

We show that for any singular X and any binary-valued N ,

D(X,N) = 0. (B.233)

To this end, we need the following auxiliary results:

Lemma 50 (Mutually singular hypothesis). The optimal test for the binary hypoth-esis testing problem

H0 : P

H1 : Q(B.234)

with prior P H0 ∈ (0, 1) has zero error probability if and only if P ⊥ Q.

Proof. By definition, P ⊥ Q if and only if there exists an event A such that P (A) = 1and Q(A) = 0. Then the test that decides H0 if and only if A occurs yields zero errorprobability.

Lemma 51 ([153, Theorem 10]). Let µ be a probability measure on (R,B) that ismutually singular with respect to Lebesgue measure. Then there exists a Borel set Ewith µ(E) = 1 and a non-empty perfect set C such that c + E : c ∈ C is a familyof disjoint sets.

Proof of (B.233). In view of Theorem 19, we may assume that N is 0, 1-valuedwithout loss of generality. For any input X whose distribution µ is mutually singularto the Lebesgue measure, we show that there exists a vanishing sequence εn, suchthat for all n,

mmse(X|X + εnN) = 0. (B.235)

which implies that D(X,N) = 0.By Lemma 51, there exists a Borel set E and a perfect set C, such that µ(E) = 1

and c + E : c ∈ C is a family of disjoint sets. Pick any a ∈ C. Since C is perfect,there exists an ⊂ C, such that an → a and εn , an − a > 0. Since X and X + εnare supported on disjoint subsets E and E + εn respectively, their distributions aremutually singular. By Lemma 50, the optimal test for N based on X + εnN succeedswith probability one, which implies (B.235).

168

Appendix C

Proofs in Chapter 5

C.1 Injectivity of the cosine matrix

We show that the cosine matrix defined in Remark 17 is injective on Σk. Weconsider a more general case. Let l = 2k ≤ n and ω1, . . . , ωn ⊂ R. Let H be anl × n matrix where Hij = cos((i− 1)ωj). We show that each l × l submatrix formedby columns of H is non-singular if and only if cos(ω1), . . . , cos(ωn) are distinct.

Let G be the submatrix consisting of the first l columns of H. Then Gij = Ti−1(xj)where xj = cos(ωj) and Tm denotes the mth order Chebyshev polynomial of the firstkind [154]. Note that det(G) is a polynomial in (x1, . . . , xl) of degree l(l− 1)/2. Alsodet(G) = 0 if xi = xj for some i 6= j. Therefore det(G) = C

∏1≤i<j≤l(xi − xj). The

constant C is given by the coefficient of the highest order term in the contributionfrom the main diagonal

∏li=1 Ti−1(xi). Since the leading coefficient of Tj is 2j−1, we

have C = 2(l−1)(l−2)/2. Therefore,

det(G) = 2(l−1)(l−2)/2∏

1≤i<j≤l[cos(ωi)− cos(ωj)] 6= 0. (C.1)

C.2 Proof of Lemma 9

Proof. Since ‖·‖p-norms are equivalent, it is sufficient to only consider p = ∞. Ob-

serve that ε 7→ NA(ε) is non-increasing. Hence for any 2−m ≤ ε < 2−(m−1), we have

logNA(2−(m−1))

m≤ logNA(ε)

log 1ε

≤ logNA(2−m)

m− 1. (C.2)

Therefore it is sufficient to restrict to ε = 2−m and m → ∞ in (3.2) and (3.3). Tosee the equivalence of covering by mesh cubes, first note that NA(2−m) ≥ NA(2−m).On the other hand, any `∞-ball of radius 2−m is contained in the union of 3n meshcubes of size 2−m (by choosing a cube containing some point in the set together withits neighboring cubes). Thus NA(2−m) ≤ 3nNA(2−m). Hence the limits in (3.2) and(3.3) coincide with those in (3.12) and (3.13).

169


Proof. By Pinsker’s inequality,

D(P ||Q) ≥ 1

2d2(P,Q) log e, (C.3)

where d(P,Q) is the total variation distance between P and Q and 0 ≤ d(P,Q) ≤ 2.In this case where X is countable,

d(P,Q) =∑x∈X|P (x)−Q(x)|. (C.4)

By [55, Lemma 2.7, p. 33], when d(P,Q) ≤ 12,

|H(P )−H(Q)| ≤ d(P,Q) log|X |

d(P,Q)(C.5)

≤ d(P,Q) log |X |+ e−1 log e. (C.6)

When d(P,Q) ≥ 12, by (C.3), D(P ||Q) ≥ 1

8log e; when 0 ≤ d(P,Q) < 1/2, by (C.6),

d(P,Q) ≥ (|H(P )−H(Q)| − e−1 log e)+

log |X |. (C.7)

Using (C.3) again,

D(P ||Q) ≥ 1

2

[(|H(P )−H(Q)| − e−1 log e)+

log |X |

]2

log e. (C.8)

Since |H(P ) − H(Q)| ≥ δ holds in the minimization of (3.26) and (3.28), (3.30) isproved.


Lemma 52. Let A ⊂ Rk be compact, and let g : A → Rn be continuous. Then,there exists a Borel measurable function f : g(A) → A such that g(f(x)) = x for allx ∈ g(A).

Proof. For all x ∈ g(A), g−1(x) is nonempty and compact since g is continuous.For each i ∈ 1, . . . , k, let the ith component of f be

fi(x) = minti : tk ∈ g−1(x), (C.9)

where ti is the ith coordinate of tk. This defines f : g(A) → A which satisfiesg(f(x)) = x for all x ∈ g(A). Now we claim that each fi is lower semicontinuous,which implies that f is Borel measurable. To this end, we show that for any a ∈ A,f−1((a,∞)) is open. Assume the opposite, then there exists a sequence ym in

170

g(A) that converges to y ∈ g(A), such that f(ym) ≤ a and f(y) > a. Due to thecompactness of A, there exists a subsequence f(yml) that converges to some point xin A. Therefore x ≤ a. But by the continuity of g, we have

g(x) = liml→∞

g(f(yml)) = liml→∞

yml = y.

Hence by definition of f and f(y) > a, we have x > a, which is a contradiction.Therefore fi is lower semicontinuous.

Proof of Lemma 19. Let Sn ⊂ Rn be bRnc-rectifiable and assume that (5.99) holdsfor all n ≥ N . Then by definition there exists a bounded subset T n ⊂ RbRnc and aLipschitz function gn : T n → Rn, such that Sn = gn(T n). By continuity, gn can beextended to the closure T n, and Sn = gn(T n). Since T n is compact, by Lemma 52,there exists a Borel function fn : Rn → RbRnc, such that gn(fn(xn)) = xn for allxn ∈ Sn. By Kirszbraun’s theorem [104, 2.10.43], gn can be extended to a Lipschitzfunction on the whole RbRnc. Then

P gn(fn(Xn)) = Xn ≥ PXn ∈ Sn

≥ 1− ε (C.10)

for all n ≥ N , which proves the ε-achievability of R.


Proof. Suppose we can construct a mapping τ : [0, 1]k → [0, 1]k+1 such that∥∥τ(xk)− τ(yk)∥∥∞ ≥M−kd(xk, yk) (C.11)

holds for all xk, yk ∈ [0, 1]k. By (C.11), τ is injective. Let W ′ = τ(W ) and g′ = gτ−1.Then by (C.11) and the L-Lipschitz continuity of g,∥∥g′(xk)− g′(yk)∥∥∞ =

∥∥g(τ−1(xk))− g(τ−1(yk))∥∥∞ (C.12)

≤ L d(τ−1(xk), τ−1(yk)) (C.13)

≤ LMk∥∥τ−1(xk)− τ−1(yk)

∥∥∞ (C.14)

holds for all xk, yk ∈ W ′. Hence g′ : W ′ → Rn is Lipschitz with respect to the `∞distance, and it satisfies g′(W ′) = g(W ).

To complete the proof of the lemma, we proceed to construct the required τ . Theessential idea is to puncture the M -ary expansion of xk such that any component hasat most k consecutive nonzero digits. For notational convenience, define r : N →0, . . . , k and ηj : ZkM → Zk+1

M for j = 0, . . . , k as follows:

r(i) = i−⌊

i

k + 1

⌋(k + 1), (C.15)

ηj(bk) = (b1, . . . , bj, 0, bj+1, . . . , bk)

T. (C.16)

171

Defineτ(xk) =

∑i∈N

ηr(i)((xk)i)M

−i. (C.17)

A schematic illustration of τ in terms of the M -ary expansion is given in Fig. C.1.

xk =

(x1)1 (x2)1 . . ....

... · · ·(xk)1 (xk)1 . . .

τ−→

τ(xk) =

0 (x1)2 . . . (x1)k 0 . . .(x1)1 0 . . . (x2)k (x1)k+1 . . .(x2)1 (x2)2 . . . (x3)k (x2)k+1 . . .

...... . . .

...... . . .

(xk−1)1 (xk−1)2 . . . (xk)k (xk−1)k+1 . . .(xk)1 (xk)2 . . . 0 (xk)k+1 . . .

Figure C.1: A schematic illustration of τ in terms of M -ary expansions.

Next we show that τ satisfies the expansiveness condition in (C.11). For anyxk 6= yk ∈ [0, 1]k, let d(xk, yk) = M−l. Then by definition, for some m ∈ 1, . . . , k,d(xm, ym) = M−l and l = mini ∈ N : (xm)i 6= (ym)i. Without loss of generality,assume that (xm)i = 1 and (ym)i = 0. Then by construction of τ , there are no morethan k consecutive nonzero digits in τ(x) or τ(y). Since the worst case is that (xm)iand (ym)i are followed by k 0’s and k (M − 1)’s respectively, we have∥∥τ(xk)− τ(xk)

∥∥∞ ≥ M−(k+l+1) (C.18)

= M−(1+k)d(xk, yk) (C.19)

which completes the proof of (C.11).


Proof. Recall that l = k + j where

j = dδne. (C.20)

Let us write q : [S]m → [0, 1]l as

q =

[pτ

], (C.21)

where p : [S]m → [0, 1]k and τ : [S]m → [0, 1]j. The idea to construct q that satisfies(5.197) is the following: for those sufficiently separated x and y in [S]m, their imagesunder p will be kept separated accordingly. For those x and y that are close together,

172

they can be resolved by the mapping τ . Suppose p and τ are constructed so that

‖p(x)− p(y)‖∞ ≥ 2−m (C.22)

holds for all x, y such that ‖x− y‖∞ > 2−m and

‖τ(x)− τ(y)‖∞ ≥ 2−1/δ (C.23)

holds for all x, y such that ‖x− y‖∞ = 2−m. Then by (C.21),

‖q(x)‖∞ = max ‖p(x)‖∞ , ‖τ(x)‖∞ , (C.24)

we arrive at a q that satisfies the required (5.190) for all distinct x and y in [S]m.First construct p. For x ∈ [S]m, define

D(x) = y ∈ S : [y]m = x 6= ∅. (C.25)

Then we could choose1 one vector in D(x) to correspond with x and denote it by x.Since g : T → S is bijective, let the inverse of g be f : S → T . By the 2−m-stabilityof g on T ,

‖a− b‖∞ ≤ 2−m ⇒ ‖g(a)− g(b)‖∞ ≤ 2−m (C.26)

holds for all a, b ∈ T . By contraposition,

‖u− v‖∞ > 2−m ⇒ ‖f(u)− f(v)‖∞ > 2−m (C.27)

holds for all u, v ∈ S.Define p(x) = f(x). Consider x 6= y ∈ [S]m. Note that ‖x− y‖∞ is an integer

multiple of 2−m and ‖x− y‖∞ ≥ 2−m. If ‖x− y‖∞ > 2−m, i.e., ‖x− y‖∞ ≥ 2−(m−1),we have |xi−yi| ≥ 2−(m−1) for some i ∈ 1, . . . , n. Without loss of generality, assumethat

yi ≥ xi + 2−(m−1) (C.28)

Also notice that by the choice of x and y,

xi ≤ xi < xi + 2−m, (C.29)

yi ≤ yi < yi + 2−m, (C.30)

Then

‖x− y‖∞ ≥|xi − yi| (C.31)

≥yi − yi + yi − xi + xi − xi (C.32)

>2−m. (C.33)

1The set D(x) is, in general, uncountable. Hence this construction of p is valid by adopting theAxiom of Choice.

173

where the last inequality follows from (C.28) – (C.30). Hence by (C.27), (C.22) holdsfor all x, y such that ‖x− y‖∞ > 2−m.

We proceed to construct τ : [S]m → [0, 1]j by a graph-theoretical argument: con-sider a undirected graph (G,E) where the vertex set is G = [S]m, and the edge setE ⊂ [S]m × [S]m is formed according to

(x, y) ∈ E⇔ ‖x− y‖∞ = 2−m (C.34)

Note that for any x in G, the number of possible y’s such that (x, y) ∈ E is at most2n − 1, since each component of y and x differ by at most the least significant bit.Therefore the maximum degree of this graph is upper bounded by 2n − 1, i.e.

∆(G) ≤ 2n − 1. (C.35)

The mapping τ essentially assigns a different j-dimensional real vector to each of thevertices in G. We choose these vectors from the following “grid”:

LM =

1

Mzj : zj ∈ ZjM

⊂ [0, 1]j, (C.36)

that is, the grid obtained by dividing each dimension of [0, 1]j uniformly into Msegments, where M ∈ N. Hence |LM | = M l. Choose M =

⌈21/δ⌉, then by (C.20) and

(C.35),|LM | = ∆(G) + 1. (C.37)

We define τ by coloring the graph G using colors from LM , such that neighboringvertices have different colors. This is possible in view of Lemma 27 and (C.37).Denote by τ(x) the color of x ∈ G. Since the vectors in LM are separated by at leastM−1 in `∞ distance, by the construction of τ and G, (C.23) holds for all x, y such that‖x− y‖∞ = 2−m. This finishes the construction of τ and the proof of the lemma.

174

Appendix D

Distortion-rate tradeoff ofGaussian inputs

In this appendix we show the expressions (6.30) – (6.32) for the minimal distortion,thereby completing the proof of Theorem 46.

D.1 Optimal encoder

Plugging the rate-distortion function of standard Gaussian

RXG(D) =

1

2log+ 1

D(D.1)

into (6.6) yields the equality in (6.30).

D.2 Optimal linear encoder

We compute the minimal distortion D∗L(X,R, σ2) achievable with the optimallinear encoder. Let the sensing matrix by H. Since Xn and Y k = HXn + σNk

are jointly Gaussian, the conditional distribution of Xn given Y k is N (Xn,ΣXn|Y k),where

Xn = HT(HHT + σ2Ik)−1Y k (D.2)

ΣXn|Y k = In −HT(HHT + σ2Ik)−1H (D.3)

= (In + σ−2HTH)−1. (D.4)

where we used the matrix inversion lemma. Therefore, the optimal estimator is linear,given by (D.2). Moreover,

mmse(Xn|Y k) = Tr(ΣXn|Y k) (D.5)

= Tr((In + σ−2HTH)−1). (D.6)

175

Choosing the best encoding matrix H boils down to the following optimizationproblem:

min Tr((In + σ−2HTH)−1)

s.t. Tr(HTH) ≤ k

H ∈ Rk×n(D.7)

Let HTH = UTΛU, where U is an n × n orthogonal matrix and Λ is a diagonalmatrix consisting of the eigenvalues of HTH, denoted by λ1, . . . , λn ⊂ R+. Then

Tr((In + σ−2HTH)−1) =n∑i=1

1

1 + σ−2λi(D.8)

≥ n

1 + σ−2 Tr(HTH)n

(D.9)

≥ n

1 + knσ−2

(D.10)

=n

1 +Rσ−2. (D.11)

where

• (D.9): by the strict convexity of x 7→ 11+σ−2x

on R+ and Tr(HTH) =∑n

i=1 λi;• (D.10): due to the energy constraint.

Hence

D∗L(XG, R, σ2) ≥ 1

1 +Rσ−2. (D.12)

Next we consider two cases separately:

1. R ≥ 1(k ≥ n): the lower bound in (D.12) can be achieved by

H =

[√RIn0

]. (D.13)

2. R < 1(k < n): the lower bound in (D.12) is not achievable. This is be-cause to achieve equality in (D.9), all λi be equal to R; however, rank(HTH) ≤rank(H) ≤ k < n implies that at least n − k of them are zero. Therefore thelower bound (D.11) can be further improved to:

Tr((In + σ−2HTH)−1) = n− k +∑λi>0

1

1 + σ−2λi(D.14)

≥ n− k +k

1 + σ−2 Tr(HTH)k

(D.15)

≥ n− k

1 + σ2. (D.16)

176

Hence when R < 1,

D∗L(XG, R, σ2) ≥ 1− R

1 + σ2, (D.17)

which can be achieved byH =

[Ik 0

], (D.18)

that is, simply keeping the first k coordinates of Xn and discarding the rest.

Therefore the equality in (6.31) is proved.

D.3 Random linear encoder

We compute the distortion DL(X,R, σ2) achievable with random linear encoderA. Recall that A has i.i.d. entries with zero mean and variance 1

n. By (D.6),

1

nmmse(Xn|AXn + σNk,A) =

1

nE[Tr((In + σ−2ATA)−1)

](D.19)

=1

nE

[n∑i=1

1

1 + σ−2λi

], (D.20)

where λ1, . . . , λn are the eigenvalues of ATA.As n → ∞, the empirical distribution of the eigenvalues of 1

RATA converges

weakly to the Marcenko-Pastur law almost surely [34, Theorem 2.35]:

νR(dx) = (1−R)+δ0(dx) +

√(x− a)(x− b)

2πcx1[a,b](x)dx (D.21)

where

c =1

R, a = (1−

√c)2, b = (1 +

√c)2. (D.22)

Since λ 7→ 11+σ−2λ

is continuous and bounded, applying the dominated convergencetheorem to (D.20) and integrating with respect to νR gives

DL(XG, R, σ2) = lim

n→∞1

nmmse(Xn|AXn + σNk,A)

=

∫1

1 + σ−2RxνR(dx) (D.23)

=1

2

(1−R− σ2 +

√(1−R)2 + 2(1 +R)σ2 + σ4

), (D.24)

where (D.24) follows from [34, (1.16)].Next we verify that the formula in Claim 1 which was based on replica calculations

coincides with (D.24) in the Gaussian case. Since in this case mmse(XG, snr) = 11+snr

,

177

(6.13) becomes

1

η= 1 +

1

σ2mmse(X, ηRσ−2) (D.25)

= 1 +1

σ2 + ηR(D.26)

whose unique positive solution is given by

ησ =R− 1− σ2 +

√(1−R)2 + 2(1 +R)σ2 + σ4

2R(D.27)

which lies in (0, 1). According to (6.12),

DL(XG, R, σ2) = mmse(XG, σ

−2ησ) (D.28)

=1

1 + σ−2ησ(D.29)

=2σ2

R− 1 + σ2 +√

(1−R)2 + 2(1 +R)σ2 + σ4, (D.30)

which can be verified, after straightforward algebra, to coincide with (D.24).

178

Bibliography

[1] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact sig-nal reconstruction from highly incomplete frequency information,” IEEE Trans-actions on Information Theory, vol. 52, no. 2, pp. 489 – 509, Feb. 2006.

[2] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information The-ory, vol. 52, no. 4, pp. 1289 – 1306, Apr. 2006.

[3] D. L. Donoho and J. Tanner, “Counting faces of randomly-projected polytopeswhen the projection radically lowers dimension,” Journal of the American Math-ematical Society, vol. 22, no. 1, pp. 1–53, 2009.

[4] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms forcompressed sensing,” Proceedings of the National Academy of Sciences, vol. 106,no. 45, pp. 18 914 – 18 919, Nov. 2009.

[5] Y. Wu and S. Verdu, “Renyi information dimension: Fundamental limits of al-most lossless analog compression,” IEEE Transactions on Information Theory,vol. 56, no. 8, pp. 3721 – 3748, Aug. 2010.

[6] D. L. Donoho, A. Maleki, and A. Montanari, “The noise-sensitivityphase transition in compressed sensing,” Department of Statistics, StanfordUniversity, Tech. Rep. 2010-04, Apr. 2010. [Online]. Available: http://arxiv.org/abs/1004.1218

[7] D. Guo, D. Baron, and S. Shamai (Shitz), “A single-letter characterization ofoptimal noisy compressed sensing,” in Proceedings of the Forty-Seventh AnnualAllerton Conference on Communication, Control, and Computing, Monticello,IL, Oct. 2009.

[8] G. Reeves and M. Gastpar, “Sampling bounds for sparse support recovery in thepresence of noise,” in Proceedings of the 2008 IEEE International Symposiumon Information Theory, Toronto, Canada, Jul. 2008.

[9] G. Reeves and M. Gastpar, “Fundamental tradeoffs for sparsity patternrecovery,” 2010, preprint. [Online]. Available: http://arxiv.org/abs/1006.3128

[10] Y. Kabashima, T. Wadayama, and T. Tanaka, “Statistical mechanical analysisof a typical reconstruction limit of compressed sensing,” in Proceedings of 2010IEEE International Symposium on Information Theory, Austin, TX, Jun. 2010.

179

http://arxiv.org/abs/1004.1218



[11] A. M. Tulino, G. Caire, S. Shamai (Shitz), and S. Verdu, “Support recoverywith sparsely sampled free random matrices,” in Proceedings of 2011 IEEE In-ternational Symposium on Information Theory, Saint Petersburg, Russia, Aug.2011.

[12] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basispursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33–61, 1999.

[13] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” vol. 58,no. 1, pp. 267–288, 1996.

[14] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionar-ies,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415,1993.

[15] D. L. Donoho and J. Tanner, “Counting the faces of randomly-projected hyper-cubes and orthants, with applications,” Discrete and Computational Geometry,vol. 43, no. 3, pp. 522–541, 2010.

[16] D. L. Donoho and J. Tanner, “Precise undersampling theorems,” Proceedingsof the IEEE, vol. 98, no. 6, pp. 913–924, 2010.

[17] A. Renyi, Probability Theory. Amsterdam: North-Holland Publishing Com-pany, 1970.

[18] K. Falconer, Fractal Geometry: Mathematical Foundations and Applications,2nd ed. New York: Wiley, 2003.

[19] R. Mane, “On the dimension of the compact invariant sets of certain non-linear maps,” in Dynamical Systems and Turbulence, Warwick 1980, ser. Lec-ture Notes in Mathematics. Springer, 1981, vol. 898, pp. 230 – 242.

[20] A. Ben-Artzi, A. Eden, C. Foias, and B. Nicolaenko, “Holder continuity for theinverse of Mane’s projection,” Journal of Mathematical Analysis and Applica-tions, vol. 178, pp. 22 – 29, 1993.

[21] B. R. Hunt and V. Y. Kaloshin, “Regularity of of embeddings of infinite-dimensional fractal sets into finite-dimensional spaces,” Nonlinearity, vol. 12,no. 5, pp. 1263–1275, 1999.

[22] T. Blumensath and M. E. Davies, “Sampling theorems for signals from theunion of finite-dimensional linear subspaces,” IEEE Transactions on Informa-tion Theory, vol. 55, no. 4, pp. 1872–1882, 2009.

[23] Y. Peres and B. Solomyak, “Existence of Lq dimensions and entropy dimensionfor self-conformal measures,” Indiana University Mathematics Journal, vol. 49,no. 4, pp. 1603–1621, 2000.

180

[24] A. Renyi, “On the dimension and entropy of probability distributions,” ActaMathematica Hungarica, vol. 10, no. 1 – 2, Mar. 1959.

[25] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and mea-sures,” IEEE Transactions on Information Theory, vol. 40, no. 5, pp. 1564 –1572, Sep. 1994.

[26] A. Guionnet and D. Shlyakhtenko, “On classical analogues of free entropy di-mension,” Journal of Functional Analysis, vol. 251, no. 2, pp. 738 – 771, Oct.2007.

[27] Y. Wu and S. Verdu, “Fundamental Limits of Almost Lossless Analog Compres-sion,” in Proceedings of 2009 IEEE International Symposium on InformationTheory, Seoul, Korea, June 2009.

[28] Y. Wu and S. Verdu, “MMSE dimension,” IEEE Transactions on InformationTheory, no. 8, pp. 4857 – 4879, Aug. 2011.

[29] Y. Wu, S. Shamai (Shitz), and S. Verdu, “Degrees of Freedom of InterferenceChannel: a General Formula,” in Proceedings of 2011 IEEE International Sym-posium on Information Theory, Saint Petersburg, Russia, August 2011.

[30] Y. B. Pesin, Dimension Theory in Dynamical Systems: Contemporary Viewsand Applications. Chicago, IL: University of Chicago Press, 1997.

[31] A. N. Kolmogorov, “On the shannon theory of information transmission in thecase of continuous signals,” IEEE Transactions on Information Theory, vol. 2,no. 4, pp. 102–108, Dec. 1956.

[32] A. N. Kolmogorov, “Theory of transmission of information,” American Mathe-matical Society Translations: Series 2, no. 33, pp. 291–321, 1963.

[33] B. R. Hunt and V. Y. Kaloshin, “How projections affect the dimension spectrumof fractal measures,” Nonlinearity, vol. 10, pp. 1031–1046, 1997.

[34] A. M. Tulino and S. Verdu, “Random matrix theory and wireless communica-tions,” Foundations and Trends in Communications and Information Theory,vol. 1, no. 1, Jun. 2004.

[35] P. Mattila, Geometry of Sets and Measures in Euclidean Spaces: Fractals andRectifiability. Cambridge, United Kingdom: Cambridge University Press, 1999.

[36] I. Csiszar and P. C. Shields, “Information theory and statistics: A tutorial,”Foundations and Trends in Communications and Information Theory, vol. 1,no. 4, pp. 417–528, 2004.

[37] V. Strassen, “The existence of probability measures with given marginals,”Annals of Mathematical Statistics, vol. 36, no. 2, pp. 423–439, 1965.

181

[38] E. Cinlar, Probability and Stochastics. New York: Springer, 2011.

[39] I. Csiszar, “On the dimension and entropy of order α of the mixture of proba-bility distributions,” Acta Mathematica Hungarica, vol. 13, no. 3 – 4, pp. 245 –255, Sep. 1962.

[40] R. S. Strichartz, “Self-similar measures and their fourier transforms III,” Indi-ana University Mathematics Journal, vol. 42, 1993.

[41] C. Beck, “Upper and lower bounds on the renyi dimensions and the uniformityof multifractals,” Physica D: Nonlinear Phenomena, vol. 41, no. 1, pp. 67–78,1990.

[42] A. Gyorgy, T. Linder, and K. Zeger, “On the rate-distortion function of randomvectors and stationary sources with mixed distributions,” IEEE Transactionson Information Theory, vol. 45, pp. 2110 – 2115, Sep. 1999.

[43] T. Linder and R. Zamir, “On the asymptotic tightness of the Shannon lowerbound,” IEEE Transactions on Information Theory, vol. 40, no. 6, pp. 2026 –2031, Nov. 1994.

[44] P. Zador, “Asymptotic quantization error of continuous signals and the quan-tization dimension,” IEEE Transactions on Information Theory, vol. 28, no. 2,pp. 139–149, 1982.

[45] S. Graf and H. Luschgy, “Foundations of quantization for probability distribu-tions,” 2000.

[46] K. Falconer, Techniques in Fractal Geometry. Chichester, UK: John Wiley,1997.

[47] Y. Peres and B. Solomyak, “Self-similar measures and intersections of Cantorsets,” Transactions of the American Mathematical Society, vol. 350, no. 10, pp.4065–4087, 1998.

[48] D. Feng, N. Nguyen, and T. Wang, “Convolutions of equicontractive self-similarmeasures on the line,” Illinois Journal of Mathematics, vol. 46, no. 4, pp. 1339–1351, 2002.

[49] J. E. Hutchinson, “Fractals and self-similarity,” Indiana University Mathemat-ics Journal, vol. 30, no. 5, pp. 713–747, 1981.

[50] Y. Peres, W. Schlag, and B. Solomyak, “Sixty years of Bernoulli convolutions,”in Fractal geometry and stochastics II, ser. Progress in Probability, C. Bandt,S. Graf, and M. Zahle, Eds. Springer, 2000, vol. 46, pp. 39–65.

[51] J. S. Geronimo and D. P. Hardin, “An exact formula for the measure dimensionsassociated with a class of piecewise linear maps,” Constructive Approximation,vol. 5, no. 1, pp. 89–98, 1989.

182

[52] N. Patzschke, “Self-conformal multifractal measures,” Advances in AppliedMathematics, vol. 19, no. 4, pp. 486–513, 1997.

[53] K. S. Lau and S. M. Ngai, “Multifractal measures and a weak separation con-dition,” Advances in Mathematics, vol. 141, no. 1, pp. 45–96, 1999.

[54] Y. Wu, S. Shamai (Shitz), and S. Verdu, “A general formula for degrees offreedom of the interference channel,” submitted to IEEE Transactions on In-formation Theory, 2011.

[55] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems. Academic Press, Inc., 1982.

[56] I. Csiszar, “Generalized cutoff rates and Renyi’s information measures,” IEEETransactions on Information Theory, vol. 41, no. 1, pp. 26 – 34, 1995.

[57] P. N. Chen and F. Alajaji, “Csiszar’s cutoff rates for arbitrary discrete sources,”IEEE Transactions on Information Theory, vol. 47, no. 1, pp. 330 – 338, Janu-rary 2001.

[58] H. Shimokawa, “Renyi’s entropy and error exponent of source coding withcountably infinite alphabet,” in Proceedings of 2006 IEEE International Sym-posium on Information Theory, Seattle, WA, Jul. 2006.

[59] C. Chang and A. Sahai, “Universal quadratic lower bounds on source codingerror exponents,” in Conference on Information Sciences and Systems, 2007.

[60] R. G. Gallager, Information Theory and Reliable Communication. New York,NY: John Wiley & Sons, Inc., 1968.

[61] M. S. Pinsker, Information and Information Stability of Random Variables andProcesses. San Francisco, CA: Holden-Day, 1964.

[62] K. T. Alligood, T. Sauer, and J. A. Yorke, Chaos: an Introduction to DynamicalSystems. Springer Verlag, 1996.

[63] Y. Wu and S. Verdu, “MMSE dimension,” in Proceedings of 2010 IEEE Inter-national Symposium on Information Theory, Austin, TX, June 2010.

[64] Y. Wu and S. Verdu, “Functional properties of MMSE and mutual information,”submitted to IEEE Transactions on Information Theory, 2010.

[65] D. Guo, S. Shamai (Shitz), and S. Verdu, “Mutual Information and MinimumMean-Square Error in Gaussian Channels,” IEEE Transactions on InformationTheory, vol. 51, no. 4, pp. 1261 – 1283, Apr. 2005.

[66] D. Guo, Y. Wu, S. Shamai (Shitz), and S. Verdu, “Estimation in GaussianNoise: Properties of the Minimum Mean-square Error,” IEEE Transactions onInformation Theory, vol. 57, no. 4, pp. 2371 – 2385, Apr. 2011.

183

[67] M. S. Pinsker, “Optimal filtering of square-integrable signals in Gaussian noise,”Problemy Peredachi Informatsii, vol. 16, no. 2, pp. 52–68, 1980.

[68] M. Nussbaum, “Minimax risk: Pinsker bound,” in Encyclopedia of StatisticalSciences. New York: Wiley, 1999, pp. 451–460.

[69] B. Z. Bobrovsky and M. Zakai, “Asymptotic a priori estimates for the error inthe nonlinear filtering problem,” IEEE Transactions on Information Theory,vol. 28, no. 2, pp. 371–376, 1982.

[70] J. K. Ghosh, Higher Order Asymptotics. Hayward, CA: Institute of Mathe-matical Statistics, 1994.

[71] J. K. Ghosh, B. K. Sinha, and S. N. Joshi, “Expansions for posterior probabilityand integrated Bayes risk,” in Statistical Decision Theory and Related TopicsIII. New York, NY: Academic Press, 1982, pp. 403–456.

[72] M. V. Burnasev, “Investigation of second order properties of statistical estima-tors in a scheme of independent observations,” Math. USSR Izvestiya, vol. 18,no. 3, pp. 439–467, 1982.

[73] P. J. Huber, Robust Statistics. New York, NY: Wiley-Interscience, 1981.

[74] L. D. Brown, “Admissible estimators, recurrent diffusions, and insoluble bound-ary value problems,” Annals of Mathematical Statistics, vol. 42, no. 3, pp. 855–903, 1971.

[75] V. V. Prelov and E. C. van der Meulen, “Asymptotics of Fisher Information un-der Weak Perturbation,” Problems of Information Transmission, vol. 31, no. 1,pp. 14 – 22, 1995.

[76] M. S. Pinsker, V. V. Prelov, and S. Verdu, “Sensitivity of channel capacity,”IEEE Transactions on Information Theory, vol. 41, no. 6, pp. 1877–1888, Nov.1995.

[77] V. V. Prelov and S. Verdu, “Second-order asymptotics of mutual information,”IEEE Transactions on Information Theory, vol. 50, no. 8, pp. 1567–1580, Aug.2004.

[78] A. J. Stam, “Some inequalities satisfied by the quantities of information ofFisher and Shannon,” Information and Control, vol. 2, no. 2, pp. 101–112,1959.

[79] H. L. Van Trees, Detection, Estimation, and Modulation theory, Part I. NewYork, NY: Wiley, 1968.

[80] L. D. Brown and L. Gajek, “Information inequalities for the Bayes risk,” Annalsof Statistics, vol. 18, no. 4, pp. 1578–1594, 1990.

184

[81] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed. NewYork, NY: Springer, 1998.

[82] C. Weidmann and M. Vetterli, “Rate distortion behavior of sparse sources,”Aug. 2010, in revision for IEEE Transactions on Information Theory.

[83] A. Lozano, A. M. Tulino, and S. Verdu, “Optimum power allocation for parallelGaussian channels with arbitrary input distributions,” IEEE Transactions onInformation Theory, vol. 52, no. 7, pp. 3033–3051, Jul. 2006.

[84] W. Rudin, Principles of Mathematical Analysis, 3rd ed. New York, NY:McGraw-Hill, 1976.

[85] K. Knopp and F. Bagemihl, Theory of Functions, Parts I and II. Mineola,NY: Dover Publications, 1996.

[86] Y. Wu and S. Verdu, “Functional properties of MMSE,” in Proceedings of 2010IEEE International Symposium on Information Theory, Austin, TX, June 2010.

[87] R. Zamir, “A proof of the Fisher information inequality via a data processingargument,” IEEE Transactions on Information Theory, vol. 44, no. 3, pp. 1246–1250, Aug. 1998.

[88] S. Verdu, “Mismatched estimation and relative entropy,” IEEE Transactionson Information Theory, vol. 56, no. 8, pp. 3712–3720, Jul. 2010.

[89] T. Weissman, “The relationship between causal and noncausal mismatched es-timation in continuous-time AWGN channels,” IEEE Transactions on Informa-tion Theory, vol. 56, no. 9, pp. 4256–4273, Aug. 2010.

[90] C. E. Shannon, “A mathematical theory of communication,” Bell System Tech-nical Journal, vol. 27, pp. 379 – 423, 623 – 656, 1948.

[91] J. R. Munkres, Topology, 2nd ed. Upper Saddle River, NJ: Prentice-Hall, 2000.

[92] A. Montanari and E. Mossel, “Smooth compression, Gallager bound and nonlin-ear sparse graph codes,” in Proceedings of 2008 IEEE International Symposiumon Information Theory, Toronto, Canada, Jul. 2008.

[93] E. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incompleteand inaccurate measurements,” Communications on Pure and Applied Mathe-matics, vol. 59, no. 8, pp. 1207–1223, Aug. 2006.

[94] R. Calderbank, S. Howard, and S. Jafarpour, “Construction of a large classof deterministic matrices that satisfy a statistical isometry property,” IEEEJournal of Selected Topics in Signal Processing, special issue on CompressedSensing, vol. 29, no. 4, pp. 358–374, 2009.

185

[95] W. Xu and B. Hassibi, “Compressed sensing over the Grassmann manifold: Aunified analytical framework,” in Proceedings of the Forty-Sixth Annual AllertonConference on Communication, Control, and Computing, 2008, pp. 562–567.

[96] T. T. Cai, L. Wang, and G. Xu, “New bounds for restricted isometry constants,”IEEE Transactions on Information Theory, vol. 56, no. 9, pp. 4388–4394, 2010.

[97] W. Li, “The sharp Lipschitz constants for feasible and optimal solutions of aperturbed linear program,” Linear algebra and its applications, vol. 187, pp.15–40, 1993.

[98] P. Halmos, Naive Set Theory. Princeton, NJ: D. Van Nostrand Company,1960.

[99] K. Kuratowski, Topology, vol. I. New York, NY: Academic Press, 1966.

[100] J. T. Schwartz, Nonlinear Functional Analysis. New York, NY: Gordon andBreach Science Publishers, 1969.

[101] G. B. Folland, Real Analysis: Modern Techniques and Their Applications,2nd ed. New York, NY: Wiley-Interscience, 1999.

[102] H. Steinhaus, “Sur les distances des points des ensembles de mesure positive,”Fundamenta Mathematicae, vol. 1, pp. 93 – 104, 1920.

[103] G. J. Minty, “On the extension of Lipschitz, Lipschitz-Holder continuous, andmonotone functions,” Bulletin of the American Mathematical Society, vol. 76,no. 2, pp. 334 – 339, 1970.

[104] H. Federer, Geometric Measure Theory. New York, NY: Springer-Verlag, 1969.

[105] D. Preiss, “Geometry of measures in Rn: Distribution, rectifiability, and densi-ties,” Annals of Mathematics, vol. 125, no. 3, pp. 537 – 643, 1987.

[106] G. A. Edgar, Integral, Probability, and Fractal Measures. New York, NY:Springer, 1997.

[107] M. Keane and M. Smorodinsky, “Bernoulli schemes of the same entropy arefinitarily isomorphic,” Annals of Mathematics, vol. 109, no. 2, pp. 397–406,1979.

[108] A. Del Junco, “Finitary codes between one-sided Bernoulli shifts,” Ergodic The-ory and Dynamical Systems, vol. 1, pp. 285 – 301, 1981.

[109] J. G. Sinaı, “A weak isomorphism of transformations with an invariant mea-sure,” Doklady Akademii Nauk SSSR, vol. 147, pp. 797 – 800, 1962, Englishtranslation, American Mathematical Society Translations (2) 57 (1966), 123 –143.

186

[110] K. E. Petersen, Ergodic Theory. Cambridge, United Kingdom: CambridgeUniversity Press, 1990.

[111] W. H. Schikhof, Ultrametric Calculus : An Introduction to p-adic Analysis.New York, NY: Cambridge University Press, 2006.

[112] M. A. Martin and P. Mattila, “k-dimensional regularity classifications for s-fractals,” Transactions of the American Mathematical Society, vol. 305, no. 1,pp. 293 – 315, 1988.

[113] A. Edelman, “Eigenvalues and condition numbers of random matrices,” SIAMJournal on Matrix Analysis and Applications, vol. 9, no. 4, pp. 543–560, Oct.1988.

[114] R. A. Brualdi, Introductory Combinatorics, 3rd ed. Upper Saddle River, NJ:Prentice Hall, 1999.

[115] Y. Wu, “MMSE dimension and noisy compressed sensing,” poster in ThirdAnnual School of Information Theory, University of Southern California, LosAngeles CA, Aug. 2010.

[116] Y. Wu and S. Verdu, “Optimal phase transitions in compressed sensing,” sub-mitted to IEEE Transactions on Information Theory, Jul. 2011.

[117] T. Tanaka, “A statistical-mechanics approach to large-system analysis ofCDMA multiuser detectors,” IEEE Transactions on Information Theory,vol. 48, pp. 2888–2910, Nov. 2002.

[118] D. Guo and S. Verdu, “Randomly spread CDMA: Asymptotics via statisticalphysics,” IEEE Transactions on Information Theory, vol. 51, no. 6, pp. 1983–2010, Jun. 2005.

[119] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UnitedKingdom: Cambridge University Press, 2004.

[120] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compres-sion. Englewood Cliffs, NJ: Prentice-Hall, 1971.

[121] T. Goblick Jr., “Theoretical limitations on the transmission of data from analogsources,” IEEE Transactions on Information Theory, vol. 11, no. 4, pp. 558–567, 1965.

[122] J. Ziv, “The behavior of analog communication systems,” IEEE Transactionson Information Theory, vol. 16, no. 5, pp. 587–594, 1970.

[123] A. Maleki, “Approximate message-passing algorithms for compressed sensing,”Ph.D. dissertation, Department of Electrical Engineering, Stanford University,Mar. 2011.

187

[124] J. D. Blanchard, C. Cartis, J. Tanner, and A. Thompson, “Phase transitionsfor greedy sparse approximation algorithms,” Applied and Computational Har-monic Analysis, pp. 188 – 203, 2010.

[125] M. Bayati and A. Montanari, “The LASSO risk for Gaussian matrices,”preprint, 2010. [Online]. Available: http://arxiv.org/abs/1008.2581

[126] M. Rudelson and R. Vershynin, “Smallest singular value of a random rectan-gular matrix,” Communications on Pure and Applied Mathematics, vol. 62, pp.1707–1739, 2009.

[127] R. Gribonval and M. Nielsen, “Sparse representations in unions of bases,” IEEETransactions on Information Theory, vol. 49, no. 12, pp. 3320–3325, 2003.

[128] S. Foucart and M. J. Lai, “Sparsest solutions of underdetermined linear systemsvia `q-minimization for 0 < q < 1,” Applied and Computational HarmonicAnalysis, vol. 26, no. 3, pp. 395–407, 2009.

[129] M. E. Davies and R. Gribonval, “Restricted isometry constants where `p sparserecovery can fail for 0 < p ≤ 1,” IEEE Transactions on Information Theory,vol. 55, no. 5, pp. 2203–2214, 2009.

[130] J. D. Blanchard, C. Cartis, and J. Tanner, “Compressed sensing: How sharp isthe restricted isometry property?” SIAM Review, vol. 53, no. 1, pp. 105–125,2011.

[131] M. Wang, W. Xu, and A. Tang, “On the performance of sparse recovery via`p-minimization (0 ≤ p ≤ 1),” to appear in IEEE Transactions on InformationTheory, 2011.

[132] Y. Zhang, “Theory of compressive sensing via `1-minimization: A non-RIP anal-ysis and extensions,” Department of Computational and Applied Mathematics,Rice University, Tech. Rep. TR08–11, 2008.

[133] D. L. Donoho and J. Tanner, “Sparse nonnegative solution of underdeterminedlinear equations by linear programming,” Proceedings of the National Academyof Sciences, vol. 102, no. 27, pp. 9446 – 9451, 2005.

[134] Y. Zhang, “A simple proof for recoverability of `1-minimization (I): the non-negativity case,” Department of Computational and Applied Mathematics, RiceUniversity, Tech. Rep. TR05-10, 2005.

[135] M. J. Wainwright, “Information-theoretic limitations on sparsity recovery inthe high-dimensional and noisy setting,” IEEE Transactions on InformationTheory, vol. 55, no. 12, pp. 5728–5741, Dec. 2009.

[136] A. K. Fletcher, S. Rangan, and V. K. Goyal, “Necessary and sufficient condi-tions for sparsity pattern recovery,” IEEE Transactions on Information Theory,vol. 55, no. 12, pp. 5758–5772, Dec. 2009.

188


[137] E. C. Posner and E. R. Rodemich, “Epsilon entropy and data compression,”Annals of Mathematical Statistics, vol. 42, no. 6, pp. 2079–2125, Dec. 1971.

[138] S. Ihara, “On the capacity of channels with additive non-gaussian noise*,” In-formation and Control, vol. 37, no. 1, pp. 34–39, 1978.

[139] T. Tao, “Sumset and inverse sumset theory for Shannon entropy,” Combina-torics, Probability & Computing, vol. 19, no. 4, pp. 603–639, 2010.

[140] I. Z. Ruzsa, “An application of graph theory to additive number theory,” Sci-entia Ser. A, vol. 3, pp. 97 – 109, 1989.

[141] V. A. Kaimanovich and A. M. Vershik, “Random walks on discrete groups:boundary and entropy,” Annals of Probability, vol. 11, no. 3, pp. 457–490, 1983.

[142] T. Tao, An Epsilon of Room, II: pages from year three of a mathematical blog.Providence, RI: American Mathematical Society, 2011.

[143] M. Madiman, “On the entropy of sums,” in Proceedings of 2008 IEEE Infor-mation Theory Workshop, Porto, Portugal, 2008, pp. 303–307.

[144] T. Tao and V. Vu, “Entropy methods,” unpublished notes. [Online]. Available:http://www.math.ucla.edu/∼tao/preprints/Expository/chapter entropy.dvi

[145] I. Z. Ruzsa, “Entropy and sumsets,” Random Structures and Algorithms, vol. 34,pp. 1–10, Jan. 2009.

[146] M. Madiman and I. Kontoyiannis, “The entropies of the sum and the differenceof two IID random variables are not too different,” in Proceedings of 2010 IEEEInternational Symposium on Information Theory, Austin, TX, June 2010, pp.1369–1372.

[147] B. Bukh, “Sums of dilates,” Combinatorics, Probability and Computing, vol. 17,no. 05, pp. 627–639, 2008.

[148] J. W. S. Cassels, An Introduction to Diophantine Approximation. Cambridge,United Kingdom: Cambridge University Press, 1972.

[149] A. Y. Khinchin, Continued Fractions. Chicago, IL: The University of ChicagoPress, 1964.

[150] L. C. Evans and R. F. Gariepy, Measure Theory and Fine Properties of Func-tions. Boca Raton, FL: CRC Press, 1992.

[151] J. Doob, “Relative limit theorems in analysis,” Journal d’AnalyseMathematique, vol. 8, no. 1, pp. 289–306, 1960.

[152] E. M. Stein and R. Shakarchi, Real analysis: measure theory, integration, andHilbert spaces. Princeton, NJ: Princeton University Press, 2005.

189

http://www.math.ucla.edu/~tao/preprints/Expository/chapter_entropy.dvi

[153] V. Prokaj, “A Characterization of Singular Measures,” Real Analysis Exchange,vol. 29, pp. 805–812, 2004.

[154] G. Szego, Orthogonal polynomials, 4th ed. Providence, RI: American Mathe-matical Society, 1975.

190

Shannon Theory for Compressed Sensing - Yale Universityyw562/reprints/wu.thesis.pdf · 2012. 1....

Documents

Transcript of Shannon Theory for Compressed Sensing - Yale Universityyw562/reprints/wu.thesis.pdf · 2012. 1....