A Gentle Introduction to Concentration Inequalitiessridharan/concentration.pdfA Gentle Introduction...

A Gentle Introduction to Concentration Inequalities

Karthik Sridharan

Abstract

This notes is ment to be a review of some basic inequalities andbounds on Random variables. A basic understanding of probabilitytheory and set algebra might be required of the reader. This documentis aimed to provide clear and complete proof for some inequalities. Forreaders familiar with the topics, many of the steps might seem trivial.None the less they are provided to simplify the proofs for readers newto the topic. This notes also provides to the best of my knowledge,the most generalized statement and proof for Symmetrization lemma. Ialso provides the less famous but geralized proof for Jensen’s inequalityand logarithmic sobolev inequality. Refer [2] for a more detailed reviewof many of these inequalities with examples demonstrating their uses.

1 Preliminary

Throughout this notes we shall consider a probability space (⌦, E , P ) where⌦ is the sample space, E is the event class which is a ��algebra on ⌦ and Pis a probability measure. Further, we shall assume that there exists a borelmeasurable function mapping every point ! 2 ⌦ to a real number uniquelycalled a random variable. We shall call the space of random variables X(note that X ✓ R). Further, we shall assume that all the functions and setsdefined in the notes are measurable under the probability measure P .

2 Chebychev’s Inequality

Let us start this notes by proving what is refered to as Chebychev’s Inequal-ity in [3]. Note, often by Chebychev’s inequality an inequality derived fromthe below proved theorem is used. However [3] refers to this inequality asTchebyche↵’s Inequality and the same is followed in this notes.

1

Theorem 1 For some a 2 X where X ✓ R, let f be a non-negative function

such that {f(x) � b|8x � a} , where b 2 Y where Y ✓ R. Then the following

inequality holds,

P (x � a) E{f(x)}b

Proof Let set X1 = {x : x � a&x 2 X} therefore we have,

X1 ✓ X

Since f is a non-negative function, taking the lebesgue integral of the func-tion over sets X1 and X we have,

Z

XfdP �

Z

X1

fdP � b

Z

X1

dP

whereRX1

dP = P (X1). However the lebague integral over probability mea-sure of a function is its expectation. Hence we have,

E{f(x : x 2 X)} � bP (X1)

) E{f(x)} � bP (x � a)

and henceP (x � a) E{f(x)}

b(1)

Now let us suppose that the function f is monotonically increasing. There-fore for every x � a, f(x) � f(a). In (1) use b = f(a). Therefore weget

P (x � a) E{f(x)}f(a)

From this we can get the well known inequalities like

P (x � a) E{x}a

called Markov inequality which holds for a > 0 and nonnegetive x,

P (|x� E(x)| � a) E{|x� E(x)|2}a2

=V ar{x}

a2(2)

often called the Chebychev’s Inequality1 and the Cherno↵’s bound

P (x � a) Eesx

esa(3)

1In this note, V ar{} and �2

are used interchangeably for variance

2

3 Information Theoretic Bounds

3.1 Jensen’s Inequality

Here we shall state and prove a generalized, measure theoretic proof forJensen’s inequality. In general, in probability theory, a more specific formof Jensen’s inequality is famous. But before that we shall first define a con-vex function.

Definition A function �(x) is defined to be convex in interval (a, b) if forevery point x0 in the interval (a, b) there exists an m such that

�(x) � m(x� x0) + �(x0) (4)

for any x 2 (a, b)

Note that this definition can be proved to be equivallent to the definitionof convex function as one in which the value of the function for any pointin the interval of convexity is always below the line segment joining the endpoints of any subinterval of the convex interval containing that point. Wechose this particular definition for simplyfying the proof of Jensen’s inequal-ity. Now without further a due, let us move to stating and proving Jensen’sInequality. (Note: Refer [4] for a similar generalized proof for Jensen’s In-equality.)

Theorem 2 Let f and µ be measurable functions of x which are finite a.e.

on A ✓ Rn. Now let fµ and µ be integrable on A and µ � 0. If � is a

function which is convex in interval (a, b) which is the range of function fand

RA �(f)µ exists then,

�(RA fµRA µ

) RA �(f)µR

A µ(5)

Proof From our assumptions, the range of f is (a, b) which is the intervalin which �(x) is convex. Hence, consider the number,

x0 =RA fµRA µ

3

Clearly it is within the interval (a, b). Further, from Equation (4) we havefor almost every x,

�(f(x)) � m(f(x)� x0) + �(x0)

Multiplying by µ and integrating both sides we get,Z

A�(f)µ � m(

Z

Afµ� x0

Z

Aµ) + �(x0)

Z

Aµ

Now see thatRA fµ� x0

RA µ = 0 and hence we have,Z

A�(f)µ � �(x0)

Z

Aµ

hence we get the result,Z

A�(f)µ � �(

RA fµRA µ

)Z

Aµ

Now note that if µ is a probability measure then,RA µ = 1 and since

expected value is simply lebesgue integral of function w.r.t. probabilitymeasure, we have

E[�(f(x))] � �(E[f(x)]) (6)

for any function � convex for the range of function f(x).Now a function �(x) is convex if �0(x) exists and is monotonically in-

creasing and if second derivative exists and is nonnegetive. Therefore wecan conclude that the function �log x is a convex function. Therefore byJensen’s inequality, we have

E[�log f(x)] � log E[f(x)]

Now if we take function f(x) to be the probability result we get the resultthat Entropy H(P ) is always greater than or equal to 0. If we make f(x) theratio of two probability measures dP and dQ, we get the result that relativeentropy or KL divergence of two distributions is always non negetive. Thatis

D(P ||Q) = EP [log(dQ(x)dP (x)

)] � log(EP [dQ(x)dP (x)

]) = log(EQ[dQ(x)]) = 0

Therefore,D(P ||Q) � 0

4

3.2 Han’s Inequality

We shall first prove the Han’s Inequality for entropy and then usingtheresult, we shall prove the Han’s Inequality for relative entropy.

Theorem 3 Let x1, x2, ..., xn be discrete random variables from sample space

X. Then

H(x1, ..., xn) 1n� 1

nX

i=1

H(x1, ..., xi�1, xi+1, ..., xn) (7)

Proof Note that D(PX,Y ||PX ⇥ PY ) = H(X)�H(X|Y ). Since we alreadyproved that Relative entropy is non-negetive, we have, H(X) � H(X|Y ).This in a vague way means that information (about some variable Y ) canonly reduce entropy or uncertainity (H(X|Y )), which makes intutive sense.Now consider the entropy,

H(x1, ..., xn) = H(x1, ..., xi�1, xi+1, ..., xn) + H(xi|x1, ..., xi�1, xi+1, ..., xn)

Since we have already seen that H(X) � H(X|Y ), applying this we have,

H(x1, ..., xn) H(x1, ..., xi�1, xi+1, ..., xn) + H(xi|x1, ..., xi�1)

Summing both sides upto n we get

nH(x1, ..., xn) nX

i=1

H(x1, ..., xi�1, xi+1, ..., xn) + H(xi|x1, ..., xi�1)

Now note that by definition of conditional entropy, H(X|Y ) = H(X, Y ) �H(Y ). Therefore extending this to many variables we get the chain rule ofentropy as,

H(x1, ..., xn) =nX

i=1

H(xi|x1, ..., xi�1)

Therefore using this chain rule,

nH(x1, ..., xn) nX

i=1

H(x1, ..., xi�1, xi+1, ..., xn) + H(x1, ..., xn)

Therefore,

H(x1, ..., xn) 1n� 1

nX

i=1

H(x1, ..., xi�1, xi+1, ..., xn)

5

Now we shall prove the Han’s inequality for relative entropies. Letxn

1 = x1, x2, ..., xn be discrete random variables from sample space X justlike in the previous case. Now let P and Q be probability distributions inthe product space Xn and let P be a distribution such that P

X

n (x1,...,xn

dxn

1) =

dP1(x1)dx1

dP2(x2)dx2

...dPn

(xn

)dx

n

. That is distribution P assumes independence of thevariables (x1, ..., xn) with the probability density function of each variablexi as dP

i

(x)dx

i

. Let x(i) = (x1, ..., xi�1, xi+1, ..., xn). Now with this setting weshall state and prove Han’s relative entropy inequlity.

Theorem 4 Given any distribution Q on product space Xnand a distribu-

tion P on Xnwhich assumes independence of variables (x1, ..., xn)

D(Q||P ) 1n� 1

nX

i=1

D(Q(i)||P (i)) (8)

where

Q(i)(x(i)) =Z

X

dQ(x1, ..., xi�1, xi, xi+1, ..., xn)dxn

1dxi

and

P (i)(x(i)) =Z

X

dP (x1, ..., xi�1, xi, xi+1, ..., xn)dxn

1dxi

Proof By definition of relative entropy,

D(Q||P ) =Z

Xn

dQlog(dQ

dP) =

Z

Xn

dQ

dxn1log(dQ)� dQ

dxn1log(

dP

dxn1)dxn

1 (9)

In the above equation, consider the termRXn

dQdxn

1log( dP

dxn

1))dxn

1 . From ourassumption about P we know that

dP (xn1 )

dxn1

=dP1(x1)

dx1

dP2(x)dx2

...dPn(xn)

dxn

Now P (i)(x(i)) =RX

dP (x1,...,xi�1,x

i

,xi+1,...,x

n

)dxn

1dxi, therefore,

dP (xn1 )

dxn1

=dPi(xi)

dxi

dP (i)(x(i))dx(i)

Therefore using this we getZ

Xn

dQ

dxn1log(

dP (xn1 )

dxn1

)dxn1 =

1n

(nX

i=1

Z

Xn

dQ

dxn1(log(

dPi(xi)dxi

) + log(dP (i)(x(i))

dx(i)))dxn

1 )

6

=1n

(nX

i=1

Z

Xn�1

dQ(i)

dx(i)(log(

dP (i)(x(i))dx(i)

))dx(i)) +1n

Z

Xn

dQ

dxn1log(

dP (xn1 )

dxn1

)dxn1

Rearranging the terms we get,

Z

Xn

dQ

dxn1log(

dP (xn1 )

dxn1

)dxn1 =

1n� 1

(nX

i=1

Z

Xn�1

dQ(i)

dx(i)(log(

dP (i)(x(i))dx(i)

))dx(i))

Now also note that by Han’s inequlity for entropy,

Z

Xn

dQ

dxn1log(

dQ(xn1 )

dxn1

)dxn1 �

nX

i=1

Z

Xn

dQ(i)

dx(i)log(

dQ(i)(x(i))dx(i)

)dx(i)

Therefore when we consider relative entropy given by Equation (9) we get,

D(Q||P )

� 1n� 1

Z

Xn�1

nX

i=1

dQ(i)

dx(i)log(

dQ(i)(x(i))dx(i)

)dx(i)� 1n� 1

Z

Xn�1

nX

i=1

dQ(i)

dx(i)log(

dP (i)(x(i))dx(i)

)dx(i)

Thus finally simplifying we get the required result as

D(Q||P ) 1n� 1

nX

i=1

D(Q(i)||P (i))

4 Inequalities of Sums of Random Variables

4.1 Hoe↵ding’s Inequality

Theorem 5 Let be independent bounded random variables such that the

random variable xi falls in the interval [pi, qi]. Then for any a > 0 we have

P (nX

i=1

xi � E(nX

i=1

xi) � a) e� 2a

2Pn

i=1(q

i

�p

i

)2

Proof Form the Cherno↵’s bound given by (3) we get,

P (x� E(x) � a) Ees(x�E(x))

esa

7

Let Sn =Pn

i=1 xi. Therefore we have,

P (Sn � E(Sn) � a) Ees(Sn

�E(Sn

))

esa

e�sanY

i=1

Ees(xi

�E(xi

)) (10)

Now Let y be any random variable such that p y q and Ey = 0. Thenfor any s > 0 due to convexity of exponential function we have

esy y � p

q � pesq +

q � y

q � pesp

Taking expectation on both sides we get,

Eesy �p

q � pesq +

q

q � pesp

Now let ↵ = �pq�p . Therefore,

Eesy (↵es(q�p) + (1� ↵))e�s↵(q�p)

) Eesy elog(↵es(q�p)+(1�↵))�s↵(q�p)

) Eesy e�(u) (11)

Where the function �(u) = log(↵eu + (1� ↵))� u↵ and u = s(q � p).Now using Taylor’s theorem, we have for some ⌘,

�(x) = �(0) + x�0(0) +x2

2�00(⌘) (12)

But we have �(0) = 0 and �0(x) = ↵ex

↵ex+(1�↵) �↵. Therefore �0(0) = 0 Now,

�00(x) =↵(1� ↵)ex

(1� ↵+ ↵ex)2

If we consider �00(x) we see that the function is maximum when

�000(x) =↵(1� ↵)ex

(1� ↵+ ↵ex)2� 2↵2(1� ↵)e2x

(1� ↵+ ↵ex)3= 0

) ex =1� ↵↵

8

Therefore, for any x

�00(x) ((1� ↵)2)(2� 2↵)2

=14

Therefore from (12) and (11) we have

Eesy eu

2

8

Therefore for any p y q

Eesy es

2(q�p)2

8 (13)

Using this in (10) we get,

P (Sn � E(Sn) � a) e�saes2

Pn

i=1(q

i

�p

i

)2

8

Now we find the best bound by minimizing the L.H.S. of the above equationw.r.t s. Therefore we have

des2

Pn

i=1(q

i

�p

i

)2

8 �sa

ds= es2

Pn

i=1(q

i

�p

i

)2

8 �sa(2s

Pni=1 (qi � pi)2

8� a) = 0

Therefore for the best bound we have

s =4a

Pni (qi � pi)2

and correspondingly we get

P (Sn � E(Sn) � a) e� 2a

2Pn

i=1(q

i

�p

i

)2 (14)

Now an interesting result from the Hoe↵ding’s inequality often used inlearning theory is to bound not the di↵ernce in sum and its correspondingexpectation but the emperical average of loss function and its expectation.This is done by using (14) as,

P (Sn � E(Sn) � na) e� 2(an)2P

n

i=1(q

i

�p

i

)2

) P (Sn

n� E(Sn)

n� a) e

� 2(aN)2Pn

i=1(q

i

�p

i

)2

Now Enx = Sn

n denotes the emperical average of x and E(Sn

)n = Ex. There-

fore we have

P (En(x)� E(x) � a) e� 2n

2a

2Pn

i=1(q

i

�p

i

)2 (15)

9

4.2 Bernstein’s Inequality

Hoe↵ding’s inequality does not use any knowledge about the distribution ofvariables. The Bernstein’s inequality [7] uses the variance of the distributionto get a tighter bound.

Theorem 6 Let x1, x2, ..., xn be independent bounded random variables such

that Exi = 0 and |xi| & with probability 1 and let �2 = 1n

Pni=1 V ar{xi}

Then for any a > 0 we have

P (1n

nX

i=1

xi � ✏) e� n✏

2

2�

2+2&✏/3

Proof We need to re-estimate the new bound starting from (10). Let

Fi =1X

r=2

sr�2E(xri )

r!�2i

where �2i = Ex2

i . Now ex = 1 + x +P1

r=2xr

r! . Therefore,

Eesxi = 1 + sExi +

1X

r=2

E(xri )

r!

Since Exi = 0 we have,Eesx

i = 1 + Fis2�2

i

eFi

s2�2i

Consider the term Exri . Since expectation of a function is just the Lebesgue

integral of the function with respect to probability measure, we have Exri =R

P xr�1i xi. Using Schwarz’s inequality we get,

Exri =

Z

Pxr�1

i xi (Z

P|xr�1

i |2)12 (

Z

P|xi|2)

12

) Exri �i(

Z

P|xr�1

i |2)12

Proceeding to use the Schwarz’s inequality recursively n times we get

Exri �

1+ 12+ 1

22+...+ 1

2n�1

i (Z

P|x(2nr�2n+1�1)

i |)1

2n

10

{�2(1� 12

n

)i (

Z

P|x(2nr�2n+1�1)

i |)1

2n }

Now we know that |xi| &. Therefore

(Z

P|x(2nr�2n+1�1)

i |)1

2n (&(2nr�2n+1�1))

12n

Hence we getExr

i {�2(1� 12

n

)i &(r�2� 1

2n

)}

Taking limit n to infinity we get

Exri limn!1{�2(1� 1

2n

)i &(r�2� 1

2n

)}

) Exri �2

i &r�2 (16)

Therefore,

Fi =1X

r=2

sr�2E(xri )

r!�2i

1X

r=2

sr�2�2i &

r�2

r!�2i

Therefore,

Fi 1

s2&2

1X

r=2

sr&r

r!=

1s2&2i

(es& � 1� s&)

Applying this to (16) we get,

Exri e

s2�2i

(es&�1�s&)

s

2&

2

Now using (10) and the fact that �2 = �2i

n we get,

P (Sn � a) e�saes2n�2 (es&�1�s&)

s

2&

2 (17)

Now to obtain the closest bound we minimize R.H.S w.r.t s. Therefore weget

des2n�2( (es&�1�s&)

s

2&

2 )�sa

ds= e

s2n�2( (es&�1�s&)

s

2&

2 )�sa(n�2((&es& � &)

&2)� a) = 0

Therefore to get a tighter bound we have

((es& � 1)

&) =

a

n�2

11

Therefore we haves =

1&log(

a&

n�2+ 1)

Using this s in (17) we get

P (Sn � a) en�

2

&

2 ( a&

n�

2�log( a&

n�

2 +1))�a

&

log( a&

n�

2 +1)

en�

2

&

2 ( a&

n�

2�log( a&

n�

2 +1)� a&

n�

2 log( a&

n�

2 +1))

Let H(x) = (1 + x)log(1 + x)� x Therefore we get

P (Sn � a) e�n�

2

&

2 H( a&

n�

2 ) (18)

This is called the Bennett’s inequality [?]. We can derive the Bernstien’sinequality by further bounding the function H(x). Let function G(x) =32

x2

x+3 . We see that H(0) = G(0) = H 0(0) = G0(0) = 0 and we see thatH 00(x) = 1

x+1 and G00(x) = 27(x+3)3 . Therefore H 00(0) � G00(0) and further

if fn(x) of a function f represents the nth derivative of the function thenwe have Hn(0) � Gn(0) for any {n � 2}. Therefore as a consequence ofTaylor’s theorem we have

H(x) � G(x)8x � 0

Therefore applying this to (18) we get

P (nX

i

xi � a) e�n�

2

&

2 G( a&

n�

2 )

) P (nX

i

xi � a) e�a

2

2(a&+3n�

2)

Now let a = n✏. Therefore,

P (nX

i

xi � n✏) e�✏

2n

2

2(n✏&+3n�

2)

Therefore we get,

P (1n

nX

i=1

xi � ✏) e� n✏

2

2�

2+2&✏/3 (19)

An interesting phenomenon here is that if � < ✏ then the upper boundgrows as e�n✏ rather than e�n✏2 as suggested by Hoe↵ding’s inequality (14).

12

5 Inequalities of Functions of Random Variables

5.1 Efron Stien’s Inequality

Till now we only considered sum of R.V.s. Now we shall consider functionsof R.V.s. The Efron Stien [9] inequality is one of the tightest bounds known.

Theorem 7 Let S : Xn ! R be a measurable function which is invari-

ant under permutation and let the random variable Z be given by Z =S(x1, x2, ..., xn). Then we have

V ar(Z) 12

nX

i=1

E[(Z � Z 0i)2]

where Z 0i = S(x1, ..., x0i, ..., xn) where {x01, ...x0n} is another sample from the

same distribution as that of {x1, ...xn}

Proof LetEiZ = E[Z|x1, ..., xi�1, xi+1, ..., xn]

and let V = Z � EZ. Now if we define Vi as

Vi = E[Z|x1, ..., xi]� E[Z|x1, ..., xi�1],8i = 1, .., n

then V =Pn

i=1 Vi and

V ar(Z) = EV 2 = E[(nX

i=1

Vi)2] = E[nX

i=1

V 2i ] + 2E[

X

i>j

ViVj ]

Now E[XY ] = E[E[XY |Y ]] = E[Y E[X|Y ]], therefore E[VjVi] = E[ViE[Vj |x1, ..., xi]].But since i > j E[Vj |x1, ..., xi] = 0. Therefore we have,

V ar(Z) = E[nX

i=1

V 2i ] =

nX

i=1

E[V 2i ]

Now let Exj

i

represent expectation w.r.t variables {xi, ..., xj}.

V ar(Z) =nX

i=1

E[(E[Z|x1, ..., xi]� E[Z|x1, ..., xi�1])2]

13

of R.V.s. The so called Efron Stien inequality due to Michael Steel in [13] given below is one of the tightest bounds known.

=nX

i=1

Exi

1[(Exn

i+1[Z|x1, ..., xi]� Exn

i

[Z|x1, ..., xi�1])2]

=nX

i=1

Exi

1[(Exn

i+1[Z|x1, ..., xi]� Exn

i+1[Ex

i

[Z|x1, ..., xi�1]])2]

However, x2 is a convex function and hence we can apply Jensens inequality(6) and hence get,

V ar(Z) nX

i=1

Exi

1,xn

i+1[(Z � Ex

i

[Z|x1, ..., xi�1, xi+1, ..., xn])2]

Therefore,

V ar(Z) nX

i=1

E[(Z � Ei[Z])2]

Where Ei[Z] = E[Z|x1, ..., xi�1, xi+1, ..., xn]. Now let x and y be 2 indepen-dent samples from the same distribution.

E(x� y)2 = E[x2 + y2 � 2xy] = 2E[x2]� 2(E[x])2

Hence, if x and y are i.i.d’s, then V ar{x} = E[12(x� y)2]. Thus we have,

Ei[(Z � Ei[Z])2] =12Ei[(Z � Z 0i)

2]

Thus we have the Efron-Stein inequality as

V ar(Z) 12

nX

i=1

E[(Z � Z 0i)2] (20)

Notice that if function S is the sum of the random variables the inequalitybecomes an equality. Hence the bound is tight. It is often refered to asjacknife bound.

5.2 McDiarmid’s Inequality

Theorem 8 Let S : Xn ! R be a measurable function which is invari-

ant under permutation and let the random variable Z be given by Z =S(x1, x2, ..., xn). Then for any a > 0 we have

P (|Z � E[Z]| � a) 2e2 a

2Pn

i=1&

2i

14

whenever the function has bounded di↵erence [10]. That is

supx1,...,xn

,x0i

|S(x1, x2, ..., xn)� S(x1, ..., x0i, ..., xn)| &i

where Z 0i = S(x1, ..., x0i, ..., xn) where {x01, ...x0n} is a sample from the same

distribution as {x1, ...xn}

Proof Using Cherno↵’s bound (3) we get

P (Z � E[Z] � a) e�saeE[Z�E[Z]]

Now let,

Vi = E[Z|x1, ..., xi]� E[Z|x1, ..., xi�1],8i = 1, .., n

then V =Pn

i=1 Vi = Z � E[Z]. Therefore,

P (Z � E[Z] � a) e�saE[eP

n

i=1sV

i ] = e�sanY

i=1

E[esVi ] (21)

Now Let Vi be bounded by the interval [Li, Ui]. We know that |Z�Z 0i)| &i,hence it follows that |Vi| &i and hence |Ui�Li| &i. Using (13) on E[esV

i ]we get,

E[esVi ] e

s

2(Ui

�L

i

)2

8 es

2&

2i

8

Using this in (21) we get,

P (Z � E[Z] � a) e�asnY

i=1

es

2&

2i

8 = es2P

n

i=1

&

2i

8 �sa

Now to make the bound tight we simply minimize it with respect to s.Therefore to do that,

2snX

i=1

&2i8� a = 0

) s =4a

Pni=1 &

2i

Therefore the bound is given by,

P (Z � E[Z] � a) e( 4aP

n

i=1&

2i

)2P

n

i=1

&

2i

8 �( 4a

2Pn

i=1&

2i

)

15

) P (Z � E[Z] � a) e�( 2a

2Pn

i=1&

2i

)

Hence we get,

P (|Z � E[Z]| � a) 2e�( 2a

2Pn

i=1&

2i

)(22)

5.3 Logarithmic Sobolev Inequality

This inequality is very useful in deriving simple proofs for many knownbounds using a method famous as Ledoux method [11]. Before jumping intothe therem, let us first prove a useful lemma.

Lemma 9 For any positive random variable y and ↵ > 0 we have

E{ylog y}� Ey log Ey E{ylog y � ylog ↵� (y � ↵)} (23)

Proof For any (x > 0 ) we have, log x x� 1. Therefore,

log↵

Ey ↵

Ey� 1

Therefore,Ey log

↵

Ey ↵� Ey

adding E{ylog y} on both sides we get,

Ey log↵� Ey logEy + E{ylog y} ↵� Ey + E{ylog y}

Therefore simplifying we get the required result as

E{ylog y}� Ey log Ey E{ylog y � ylog ↵� (y � ↵)}

Now just like in the previous two sections, let S : Xn ! R be a mea-surable function which is invariant under permutation and let the randomvariable Z be given by Z = S(x1, x2, ..., xn). Here we assume independenceof (x1, x2, ..., xn). Z 0i = S(x1, ..., x

0i, ..., xn) where {x01, ...x0n} is another sam-

ple from the same distribution as that of {x1, ...xn}. Now we are ready tostate and prove the logarithmic Sobolev inequality.

16

Theorem 10 If function (x) = ex � x� 1 then,

sE{ZesZ}� E{esZ}log E{esZ} nX

i=1

E{esZ (�s(Z � Z 0i))} (24)

Proof From Lemma 1 (Equation(23)) we have for any positive variable Yand ↵ = Y 0i > 0,

Ei{Y log Y }� EiY log EiY Ei{Y log Y � Y log Y 0i � (Y � Y 0i )}

Now let Y = esZ and Y 0i = esZ0i then,

Ei{Y log Y }� EiY log EiY Ei{esZ(sZ � sZ 0i)� esZ(1� esZ0i

�sZ)}

Writing it in terms of function (x) we get,

Ei{Y log Y }� EiY log EiY Ei{esZ (�s(Z � Z 0i))} (25)

Now let measure P denote the distribution of (x1, ..., xn) and let distributionQ be given by,

dQ(x1, ..., xn) = dP (x1, ..., xn)Y (x1, ..., xn)

Then we have,

D(Q||P ) = EQ{log Y } = EP {Y log Y } = E{Y log Y }

Now since Y is positive, we have

D(Q||P ) � E{Y log Y }� EY log EY (26)

However by Han’s inequality for relative entropy, rearranging Equation (8)we have,

D(Q||P ) nX

i=1

D(Q||P )�D(Q(i)||P (i)) (27)

Now we have already shown that D(Q||P ) = E{Y log Y } and further,E{Ei{Y log Y }} = E{Y log Y } therefore, D(Q||P ) = E{Ei{Y log Y }}. Nowby definition,

dQ(i)(x(i))dx(i)

=Z

X

dQ(x1, ..., xi�1, xi, xi+1, ..., xn)dxn

1dxi

=Z

X

Y (x1, ..., xn)dP (x1, ..., xi�1, xi, xi+1, ..., xn)dxn

1dxi

17

Due to the independence assumption of the sample, we can rewrite the aboveas,

dQ(i)(x(i))dx(i)

=dP (x1, ..., xi�1, xi+1, ..., xn)

dx(i)

Z

XY (x1, ..., xn)dPi(xi)

=dP (x1, ..., xi�1, xi+1, ..., xn)

dx(i)EiY

Therefore D(Q(i)||P (i)) is given by,

D(Q(i)||P (i)) =Z

Xn�1EiY log EiY dP (xi) = E{EiY log EiY }

Now using this we get,

nX

i=1

D(Q||P )�D(Q(i)||P (i)) =nX

i=1

E{Ei{Y log Y }}� E{EiY log EiY }

Therefore, using this in Equation (27) we get,

D(Q||P ) nX

i=1

E{Ei{Y log Y }}� E{EiY log EiY }

and using this inturn with Equation (26) we get,

E{Y log Y }� EY log EY nX

i=1

E{Ei{Y log Y }� EiY log EiY }

Using the above with Equation (25) and substituting Y with esZ we get thefirst required result as,

sE{ZesZ}� E{esZ}log E{esZ} nX

i=1

E{esZ (�s(Z � Z 0i))}

6 Symmetrization Lemma

The symmetrisization lemma is probably one of the easier bounds we re-view in this notes. However, it is extremely powerful since it allows us tobound the di↵erence between of emperical mean of a function and its ex-pected value, using the di↵erence between emperical means of the function

18

for 2 independent samples of the same size as the original sample. Notethat in most literature, the symmetrization lemma stated and proved onlybounds zero one functions like loss function or the actual classification func-tion. Here we derive a more generalized version where we prove the lemmafor bounding the di↵erence between expectation of any measurable functionwith bounded variance and its emperical mean.

Lemma 11 Let f : X ! R be a measurable function such that V ar{f} C. Let En{f} be the emperical mean of the function f(x) estimated using

a set (x1, x2, ..., xn) of n independent identical samples from space X. Then

for any a > 0 if n > 8Ca2 we have,

P (|En0{f}� En

00{f}| � 12a) � 1

2P (|E{f}� En{f}| > a) (28)

where En0{f} and En

00{f} stand for emperical mean of the function f(x)estimated using samples (x01, x02, ..., x0n) and (x001, x002, ..., x00n) respectively

Proof By the definition of probability,

P (|En0{f}� En

00{f}| � 12a) =

Z

X2n

1[|E

n

0{f}�En

00{f}|� 12a]

dP

Where function 1z is 1 for any z � 0 and 0 otherwise. Since X2n is theproduct space Xn ⇥Xn, using Fubini’s theorem we have,

P (|En0{f}� En

00{f}| � 12a) =

Z

Xn

Z

Xn

1[|E

n

0{f}�En

00{f}|� 12a]

dP 00dP 0

Now since the set Y = {(x1, x2, ..., xn) : |E{f}� En{f}| > a} is a subset ofXn and term inside the integral is always non-negetive,

P (|En0{f}� En

00{f}| � 12a) �

Z

Y

Z

Xn

1[|E

n

0{f}�En

00{f}|� 12a]

dP 00dP 0

Now let Let Z = {(x1, x2, ..., xn) : |En{f}�E{f(x)}| a2}. Clearly for any

sample (x1, x2, ..., x2n), if (x1, x2, ..., xn) 2 Y and (xn+1, xn+2, ..., x2n) 2 Z

it implies that (x1, x2, ..., x2n) is a sample such that |En0{f}� En

00{f}| � a2

Therefore, coming back to the integral since Z ⇢ Xn ,Z

Y

Z

Xn

1[|E

n

0{f}�En

00{f}|� 12a]

dP 00dP 0 �Z

Y

Z

Z1[|E

n

0{f}�En

00{f}|� 12a]

dP 00dP 0

19

Now since the integral is over Y and Z half spaces as we saw earlier,Z

Y

Z

Z1[|E

n

0{f}�En

00{f}|� 12a]

dP 00dP 0 =Z

Y

Z

Z1dP 00dP 0

=Z

YP (|En

0{f}� E{f}| 12a)dP 0

Therefore,Z

Y

Z

Xn

1[|E

n

0{f}�En

00{f}|� 12a]

dP 00dP 0 �Z

YP (|En

0{f}� E{f}| 12a)dP 0

Now,

P (|En0{f}� E{f}| 1

2a) = 1� P (|En

00{f}� E{f}| >12a)

Using Equation (2) (often called the chebyshev’s inequality) we get,

P (|En00{f}� E{f}| >

12a) 4V ar{f}

na2 4C

na2

Now if we choose n such that n > 8Ca2 as per our assumption then,

P (|En00{f}� E{f}| >

12a) 1

2Therefore,

P (|En00{f}� E{f}| 1

2a) � 1� 1

2=

12

Puting this back in the integral we get,

P (|En0{f}� En

00{f}| � 12a) �

Z

Y

12dP 0 =

12

Z

YdP 0

Therefore we get the final result as,

P (|En0{f}� En

00{f}| � 12a) � 1

2P (|E{f}� En{f}| > a)

Note that if we make the function f to be a zero one function then themaximum possible variance is 1

4 . Hence if we set C to 14 then the condition

under which the inequality holds becomes, n > 2a2 . Further note that if

we choose the zero one function f(x) such that it is 1 when say x is aparticular value and 0 if not, then the result basically bounds the absolutedi↵erence between probability of the event of x taking a particular value andthe frequency estimate of x taking that value in a sample of size n usingthe di↵erence in the frequencies of occurance of the value in 2 independentsamples of size n.

20

References

[1] V.N. Vapnik, A.YA. Chervonenkis, ”On the Uniform Convergence ofRelative Frequencies of Events to Their Probabilities”, Theory of Prob-ability and its Applications, 16(2):264-281, 1971.

[2] Gabor Lugosi, ”Concentration-of-Measure Inequalities” , LectureNotes, http://www.econ.upf.es/ lugosi/anu.pdf

[3] A.N. Kolmogorov, ”Foundations of Probability”, Chelsea Publications,NY, 1956.

[4] Richard L. Wheeden, Antoni Zygmund, ”Measure and Integral:An In-troduction to Real Analysis”, International Series of Monographs inPure and Applied Mathematics.

[5] W. Hoe↵ding, ”Probability Inequalities for Sums of Bounded RandomVariables”, Journal of the American Statistical Association, 58:13-30,1963.

[6] G. Benette, ”Probability Inequalities for Sum of Independent RandomVariables”, Journal of the American Statistical Association, 57:33-45,1962.

[7] S.N. Bernstein, ”The Theory of Probabilities”, Gastehizdat PublishingHouse, Moscow, 1946.

[8] V. V. Petrov, ”Sums of Independent Random Variables”, Translatedby A.A. Brown, Springer Verlag, 1975.

[9] B. Efron, C. Stein, ”The Jacknife Estimate of Variance”, Annals ofStatistics, 9:586-96, 1981.

[10] C. Mcdiarmid, ”On the Method of Bounded Di↵erences” , In Surveysin Combinatorics, LMS lecture. note series 141, 1989.

[11] Michel Ledoux, ”The Concentration of Measure Phenomenon”, Ameri-can Mathematical Society, Mathematical Surveys and Monographs, Vol89, 2001.

[12] Olivier Bousquet, Stephane Boucheron, and Gabor Lu-gosi,”Introduction to Statistical Learning Theory” ,Lecture Notes,http://www.econ.upf.es/ lugosi/mlss slt.pdf

21

[13] J. Michael Steele, “An Efron-Stein Inequality for Nonsymmetric Statistics”, Annals of Statistics, Vol 14, No. 2, Pg 753-758, 1986

A Gentle Introduction to Concentration Inequalitiessridharan/concentration.pdfA Gentle Introduction...

Documents

Transcript of A Gentle Introduction to Concentration Inequalitiessridharan/concentration.pdfA Gentle Introduction...