f-divergence.pdf

download f-divergence.pdf

of 13

Transcript of f-divergence.pdf

  • 7/29/2019 f-divergence.pdf

    1/13

    f-DIVERGENCES: SUFFICIENCY, DEFICIENCY ANDTESTING OF HYPOTHESES

    Friedrich Liese and Igor Vajda

    ABSTRACT. This paper deals with f -divergencesof probability measuresconsidered in the samegeneral form as e.g. in [12] or [45], where f is an arbitrary (not necessarily differentiable) convexfunction. Important particu1ar cases or subclasses are mentioned , including those introducedby Bhattacharyya [3]. Kakutani 132] , Shannon [61] and Kullback with Leibler [38] , Chernoff]7], Kraft [37], Matusita (47], Renyi (57] and De Groot (15]. Some important relations betweenthese subclasses are reproduced or reestablished in a new manner. The main result is a newproof of the repr(>SCntatioo of gcnernl f -d ive:rgence J, (Po, P1) by means of the informationgains G.(Po,P1 ), 0 ~ n ~ I of De Groot. This proof usc'S th e genernli?.ed Thylor formulaapplied to arbitrary convex functions derived in this paper. The basic known properties ofgeneral f -divergences are deduced in a new manner from this representation, among them theconvergence for increasing sequences of

  • 7/29/2019 f-divergence.pdf

    2/13

    114 Liese und Vajdnfor convex f : (0, oo) -+ IR where Jt is a a-finite measure which dominates Po and P 1 and theintegrand is appropriately specified at the points where the densities dPo/dJ.L and dP1/dJ.Lare zero.

    F'or f(t) = t int the f-d ivergence reduces to the classical K1(Po, P1) which is sometimesdenoted by I(Po, P,) and called information divergence or Kullback-Leibler divergence. Forthe convex or concave function f(t) = t, s > 0, we obtain the so-called Hellinger integralsH,(Po, P1) which are related to t he divergences Rs(Po ,PJ) of Renyi [57] by R, (P0, P1) =(s - l ) - 1 1n H8 (Po, P.) . Note that the divergence measures In H,(P0 , P 1) were considered for0 < s < 1 already by Chernoff [7] and Kraft [37], and the special case for s = 1/2 also byBhattattacharyya [3], T

  • 7/29/2019 f-divergence.pdf

    3/13

    f. Divergences 11 5When dealing with the sufficiency of a statistic one compares the original model with themodel reduced by the statistic. More generally, in decision theory the statis t.ical models arerelated by means of a less known tool called c:-deficiency and the related so-railed concavefunction criterion. Using the in tegral representation we show t.hat this criterion is equivalentto a comparison of certain f-divergences. This equivalence characterizes the meaning of fdivergences in st.atistical decision theory by connecting the dissimilarity of statistical modelsspecified by f-divergences with basic decision-theoretic concepts.

    In the las t section we apply Hellinger in tegrals to establish the exponential rat.e of convergence to zero for testing a simple null hypothesis versus a simple alternative if t he samplesize tends to infinity. This leads to the resu lts known in the literature as theorems of Chernoff and Stein. It also allows us to find the exponential rates for a classification problemand obtain in this manner the results of Krafft and Puri 136] in this area.

    2 CONVEX FUNCTIONSWe in troduce and study classes of distances in the space of probability distributions whichoriginated from different roots. Some of them were introduced in information theory todescribe the amount of information or the amount of uncertainty in a random sample.Others are information functionals obtained by investigating the rate of convergence of errorprobabilities when the sample size tends to infinity. Still others resul ted from the CramerR.ao inequality and its generalizations. Hellinger integrals are the Laplace transforms ofloglikelihood ratios and thus completely describe the structure of binary statistical models.Information functionals were also uS

  • 7/29/2019 f-divergence.pdf

    4/13

    116 Liese and Vajdamonotonicity follows fiom (2.2). To prove the right continuity, add e,. 1 0 to a and b in(2.3) and get, for a < b, the relation (b - a) - 1[f(b +en)- f(a +en)) ~ D+f(b +en) If firstn--+ oo and then b l a we get D+f(b) ~ lim n-oo D+f(b+e,.). Since D+f is nondecreasing,this gives the right continuity. The left continuity of v - r is proved similarly. Finally, theinequality (2.2) yields(2.6) v +f(u) ~ f(v) - f(u) ~ v+f(v), 0 < u < v.v - uFor hn = (b - a) f n we get from (2.6)

    nL[f (a + (i + l)hn) - f{a + ih,.)J ~ f(b + hn) - f(a + ltn),i= l

    and analogously J: D f ( . ~ ) d s ~ f (b - hn) - f(a - h.,,). T he continuity off completes theproof. 0The statement (2.5) implies that the limits limo:JO f(x) and limrtoo f (x) exist. We extend

    f by setting f(O) = limr 10f{x) and f{oo) = limrloo f(x), where f(O) may attain the value ooand f(oo) may attain t he values - oo or oo.As D+f is continuous from the right there is a uniquely _determined cr-finite measure -y1on the Borel sets of (0, oo) that satisfies, for every 0 < a < b,(2.7) 'YI((a,bj) = v +f(b) - v+f(a).U f is twice continuously differentiable then D+f = f' and this function is continuouslydifferentiable so that,

    v +f(b) - v +f(a) = [ f" (t)dt, and -y,(B) = L"(t)dt.Therefo re -y1 can be viewed as a measure of the curvattue off. We use this curvature measureto establish a generalized second order Taylor expansion.Lemma 2.2. I f f : (0, oo) -+ IR is convex then, for a, b> 0,

    (2.8) l (b - t)l(a,bj(t)-yl(dt)f(b) - f(a) - o+f((t)(b - a) = J(t - b)f(b ,aj(t)-r,(dt)Morrover, the function,(2.9) fo(x) = f(x ) - f(l ) - (x - l)D+f(l)has the representation,

    i f X> 1(2.10)

    i f a < bi f b < a.

    l (x - t 1\ x) I(I ,oo)(t)'Y,(dt)fo(x) = j ( t - t A x)l (u,q(t)-r1(dt) i f 0

  • 7/29/2019 f-divergence.pdf

    5/13

    f. Oivergen('esPmof. For a< b we have, from (2 .5) and the theorem of Fubini,

    bf(b) - f(a) - D+f(a)(b - a) =1 D+f(s) - D+f(a) )d.s= j (! (o,bj(s)l (o,sl(thr(dt)) ds= j (b - t)I(a,b]Uhr(dt).

    By interchanging the ro le of a and b we get, for a > b,f(b)- f(a)-D+f(a)(b - a)

    = - ( f( a ) - f (b) - D+f (b)(a - b))+ (D+f(a)- n+ f(b))(a - b)= - Ja - t)l(b,oJ(Ihr(dt) +Ja - b)J(b,ol(t)'Yr(dt)= JI - b)J(b,aJ{t}tr{dl).The statement (2.10) follows from (2.8) as fo( l ) = D+f0 (1) = 0.

    11 7

    0A conve.< function f is said to be st1'ict ly convex at Xo E (0, oo) if for no ~ > 0 the

    fun ction f is linear in (xo- xo + It is called strictly con11ex on (0, oo) if it is strictlyconvex at every xo E- (O,oc). The representat ion (2.8) shows that f: (O,oo) ~ lR is strictlyconvex at xo E (O,oo) if and only if 1'1((xo- ~ . x o +c))> 0 for every 5 > 0. If f is twicecontinuously differentiable and f" (x) > 0, 0 < x < oo t hen t he function f is strictly convexon (0, oo).

    We inspect, the function fo in (2.9) in more detail. The representat ion (2.8) shows thatfo(x ) 2:: 0 and(2.11 ) f is sttictly convex at I ~ (A) v (B ), where

    (A) fo(x) > 0 for 0 < x < 1(B) fo(x) > 0 for 1 < :1: < oo .For later purposes we define the *-conjugate function by,(2.12) f ' (x) = x f G). X> 0.If f : (0, oo) -> lR is convex, then f ' : (0, oc) --> lR is convex as well. Indeed, for 0 < a < 1and 0

  • 7/29/2019 f-divergence.pdf

    6/13

    118 L tese and Vajdawhere f (O) E (- oo, oo]. For later purpose,; we need a lso the boundary values f(O) and f (O)expressed in terms of the curvature measure -r1 as follows.Lemma 2.3. The following result holds:(2.14)(2.15)

    lim f"(x) = 11((1, oo)) + D+f(1) ,zlO~ m f(x) = Jlco,q(t)-"(dt) + f(l) - D +f(l) .Proof. It is enough to consider the function fo defined in (2.9). The representation (2.10),

    the monotone convergence theorem and -y1 = -y10 givelim .!.ro(x) = lim j . ! .I(l xJ(t ) (x- t)1 10 (dt) =11((1 , oo)),:r-oo X r-oo X '

    limf0(x) = l im/Icz 11(t)(t - x)-y10(dt) = J i co tJ(t)t-y1(dt).z !O z!O ' '0

    3 f-01VERGENCS AND RE:t..ATEO DISTANCESNow we introduce a general class of information functionals . Let Po, P 1 be probabilitymeasures defined on (X, 21) which are dominated by the a-finit.e measure 11 ru1d denote byPu and P1 the respective 11-densities, i.e. le t

    dPo dPtPo = - and 1>1 = dp.dJtDefinit ion 2. For every com1ex function f : (0, oo) __, IR the functional,(3.1) I1(Po, Pt) :=I (Po/PtlPtl{l>o>O,po >O)dJl

    + f(O )Pt(Po = 0) + f ' (O)Po(Pt = 0)is called the f -divergence of Po with respect to P1.

    The right hrutd term is well defined because f(O), f ' (O) > - oo and the inequality (2 .3)implies,

    f(Po/PI)J{J>o>O}PI 2: f(l)Pt + (D+f(1 )){Po-p.).As the right hand function is integrable we see that the integral in (3.1) is well defined butmay take on the value + oo. Note that P1(Po = 0) and P0 (p1 = 0) are the weights of thesingular parts of P1 and P0 with respect to P0 and P1, respectively. They are independentof the special choice of the dominating measure I' This follows also for the integral in (3.1)by the chain rule of measure theory. Therefore the definition of I,( Po, Pt) is independent ofthe special choice of Jt.

    The functional I,(Po. P 1) - f(1) depends only on the nonlinear part off in the followingsense.P roposition 3.1. If g(x) = f(x) + ax+ b then,(3.2) I,(Po, PI) - f(I) = I,(P0 , PI) - g( l) ,and, for g =fo from {2.9}, we haue I,(Po, PI) - f(l) = I,0(Po , P1).

    Advances in locqu(llities from Probability 'Theoy & Statistics

  • 7/29/2019 f-divergence.pdf

    7/13

    f-Dive rg:enr.csP1oof. By the definition of lr (Po, Pi) in (3. 1),

    lg(Po, P1) := .I g(Po/Pi)Pl l tvu>O,p 1 >OJ df.t+ g(O)P1 (Po =0) + g (O)Po (PI =0)

    = I (Po /TJI )PI ftvu>O,p1>0) df.t +I a(po /PJ p,I{Po>O,p 1>OJdlt+ I p, /{:Po>O,p, >OJdll+ (f(O) + b)P, (Po = O) + (f'(O) + o)Po (PI = 0)

    = lr(Po,PI) +a+b,so that 1r(P0, Pt ) - f(l) = lg(P0, P1) - g(1).

    119

    0

    Al though the funct ional 1r(P0 , PI) does not satisfy the axioms of a met.ric for general f,it has several propert.ies that allow us t.o interpret this funct ional as a "distance measure" .

    Proposition 3.2. For every convex function f we have lr(P0 , Pt) - f(l ) ~ 0, with equalityfor Po = P1. I f f is strictly comJex at xo = 1 then lr(Po, PI) - f(1) = 0 implies Po = P1.Moreover, the fttnctionallr is c ~ n j u g a t c to lr in the sense that,(3.3) lr(Po ,PI) = lr (h Po),

    Pmof. The function fo is nonnegative so that the expression, lr0 (Po , P1) = lr(Po ,P1) - f{l),is nonnegative as well. I f Po = P1 then Po = P1 J.t-a.e. and Pt(Po =0) = Po(PI = 0) = 0 sot,hat the integral on the right hand side of (.3.1) has the value f( l ). Assume now that f isstrictly convex at xo = 1. In view of Proposition 3.1 it is sufficient to consider fo. By {2.11)fo(x) > 0 for every x > 1 or f0 (x) > 0 for every 0 < x < 1. Assume the first conditionholds, t hen fo{x) 2: 0 and lr(Po, PI) - f{l ) = 0 together with (3.1 ) for f = fo show thatJt(po > PI) = 0. This implies,

    and, therefore, J.t(p1 > Po) = 0. Hence J.t{p1 # Po) = 0 and P1 = Po. T he case when fo{x) > 0for every x > 1 is similar. The sta.tement (3.3) is an immediate consequence of (2 .12) and(3.1). 0lr(Po, PI) - f( l) does not satisfy the triangular inequality and is not symmetric in {Po, PI),in general. From Definition 2 it follows that the symmetry in (Po, P1 ) holds if f (x) = f (x) :=x f (l /x). To keep the notation simple we will use the symbol 11(?0 , P I) also iff is concave.T he next display presents special parametrized classes of functions which are either convex

    or concave and provide well-known information functionals.

    Advances in Inequalities ft't')m Probability Theory &: St-atistics

  • 7/29/2019 f-divergence.pdf

    8/13

    are ca lled Hellinger integrals and are mainly used in the literature for 0 < s < 1. For somepurposes t.heir e.' Po) holds for f = f'.U this equality is not fulfilled we could turn to f = f + f'. The main problem is the triangularint'(juality which can be verified only for special f. Examples are the Hellinger distance andthe variational distance. The next example contains another class of divergences that satisfythe triangular inequality.Example 6. The functional

    Aa(Po,Pd = ~ j + v ~ ' " r dJ - I, o a - ~ 1was intn>o)dP0 + ocPo(p = 0),2(3.13) 2K2(Po, Pi) = H2(Po, P1) - 1 =X (Po, P1 ).

    Since k,(x) ~ 0, by construction we get 0 ~ K,,(Po,Pt) ~ oc . As ment.ioned above,K1(Po , P 1) is sometimes called the Kv.llback-Leibier divergence.We illustrate by examples that Hellinge.r integrals, and consequently t he divergencesK,(P0 , P1 , can be explicitly evaluated for a large variety of distributions that are importantin statistics. We consider the class of exponential families which plays a central role inmathematical statistics. Let (X, 21) be a given measurable space and T : X -+ JRd be astatistic. For any a-finite measure J.t we pu t

    i l = { IJ: j exp{ (1, T ) }dJ.t < oo} Rd,I

  • 7/29/2019 f-divergence.pdf

    9/13

    Divergences 123is a probabi lity measure on (X,Ql) , and the family of distributions (Pe)eeA is called anexponential family with generating statistic T and natural parameter space 6..Example 7. Assume that Pe, 0 E 6. ~ JRd is an exponential family with natural parameter0 E 6. and generating statistic T : X --+ JRd . The ~ ~ - d e n s i t y of Pe is then given by Pe =exp{(T, O) - I

  • 7/29/2019 f-divergence.pdf

    10/13

    124 Liese and VajdaSimilarly, for - oo < s < oo.(3.19)for the family of Poisson distributions {Po(X) : .X > 0} .

    To give a statistical interpretation of the f- divergences G,.(Po, P 1) , introduced for 0 ~1T ~ 1 in (3.4), we (:onsider the problem of testing the simple null hypothesis Ho : Poversus the alternative H1 : P1 A stat-istical test rp is then a measurable mapping 'P : X - >[0, 1] where the value ,.:>(x) represents the conditional probability of rejecting Ho when theobservation is x . Cousequent ly, f rpdP0 is t he probability of rejecting H0 when H0 is true,called the error '[Frobability of the first kind. Similarly f( l - rp)dP1 is the probability ofrejecting H1 wh en H1 is true, called the e1 -ro1 p1obability of the second kind. The mix,

    1T JpdP0 + (1 - 1r) /(1 -rp)dP1of the error, probabili t ies taken for the prior probability of the hypothesis 0 ~ 1T ~ 1, isthe Bayes' erTor pmbability or the Bayes ' risk. Each tes t which minimizes the Bayes' errorprobability is a Bayes ' test. Next we present a well known result on the Bayes' test and theminimal Bayes' error probability.Lemma 3 .5 . In the binary model ( X , ~ {Po, P1} ) the test rp8 : X -> [0, 1] defined by

    (3.20)if CPo < P!if C.Po = PI ,if ~ Pt

    for c = 1:,. is a Bayes test and the minitn(tl Bayes ' e1-ror probability,b,. (P0 , PI) = ( 1rJ pdPo + (1 - 1r) J 1 - v> )dPl)

    is given by,(3.21)

    where J.l. is au-finite dom inating measure and p; = dP;/dJJ..Proof. We have,

    1T J pdPo + (J - 1r) J1 - (I - ?r)PJ and let

  • 7/29/2019 f-divergence.pdf

    11/13

    D i v e r g e n c e s 125We see from (3.21) that the minimal Bayes' error probability is related to the divergence

    Gr.(Po, P 1) defined by means of gn in (3.4) as follows:(3.22) G,.(P0 , P1) = 1. (Po. PJ} = 1r 1\ (1 - 1r) - b,. (P0 ,P1) .The functional G.-( P0 , P1) admits t.he following interpretation proposed by De Groot[15, 161: the first term 1r /\( l - r.) is t.he minimal Bayes' error probability that can be achievedbefore the observation in the model {Po , P 1} is made and the second term b,.(Po, Pt) is the

    minimal Bayes' error probability achievable after th is observation is made. The non-negativedifference Gr.(P0 , P1) between these two errors thus represents an information gain achievedby taking the observation.We will show in the next theorem that lr(Po. PI) - f(l ) is a superposition of the information gains G,(P0 , PI) with respect to a curvature measure p1 on (0, 1) defined by,

    (3.23) p1(B) = / (1 + t)IuCt) ;(dt).The measures p1 and l'r satisfy, for every measurable h: (0, 1)--> [O,oo), the relation,(3.24) J (1r)p1(d1r) = / (1 + t)h ( 1 ~ 1) 'Yr(dt).We denote by,(3.25) Sp; = {xE(O, l ): p1(x-E,x+e:) > O forall e:>O}the support of the measure Pr.Theorem 3.6. For every corwex function f : (0, oo) --> R and arbitrary distributions Po, P1we have,(3.26) ' lr(Po, Pt) - f(l) = { G,.(Po ,Pl)pr(d1r).

    }( 0, 1)Corollary 3 .7. The follmuing holds.

    1, G,.(Po, P1)K.(Po,P,) = (1 ) 1+s 2_ 8 d1T, - oc 0 ) - (1 + t)bl/(l+t)(Po, P1))/(o,IJ(t)"Y1(dt).

    Advances in Jnequa.litics from Probability Theory & S1atistics

  • 7/29/2019 f-divergence.pdf

    12/13

    128 Liese and VajdaThere are many approaches to red uction of a large sample X 1 , , X,. One of them is touse a par tition p = { A ~ o . . . , A,.} of the sample s pace X and to replace the observations by

    the relative frequencies of these observations in the partit ion cells. Here, and in the sequel,a partit ion p means a collection {A1, ... A,.}, of subsets of X such that,{3.32) A; E 21 , A;n Aj = 0 for i f , and A 1 U U A,. = X .Instead of the origina l sampl e space (X ,21 ) we now use the sample space (X , a(p)), wherea(p) is the algebra generated by t he partition p. Assume now t hat we have an increasingsequence of parti tions p,. so that t he sequence of a-a lgebras 21., generates 21, then we canapproxjmate 21-measurable tests by 21,.-rneasurable tests. We, therefore, achieve the minimal Bayes' risk approximately, registering the cells visited by observations instead of theobservations themselves, provided n is la rge enough. Denote by P,JJ.. the restriction of P;to the sub a-algebra 21 ...Lemma 3.10. /f21 1 ~ 212 ~ is a rwndecreasing sequence of sub-u-algebms of21 whichgenemtes 21 then Gn(P;;", P'{') TG"(Po, Pt).Proof. The rnonotonicity follows from (3.27). Set P = 4Po+ P 1) and consider the densi tiesp; = dP;jdP, i = 0, I , as random va ri ables on (X, 21 , P). The condi tional expectationPi,n = Ep(Pd21 ,.) with respect to P satisfies, for every A E 21 ,.,1 - 1 - 'l lEp(p;l21 ,.)dP = PidP = P; "(A),A Awhich implies Pi ,n = dP,'ll j ci"Pll". Hence, by the Martingale convergence theorem of Levy,see [33]. EpiP; - p;,,. l __, 0, as n __, oo. Usitlg the elementary inequality Ia A b - c A dl :

  • 7/29/2019 f-divergence.pdf

    13/13

    f.Divergences 129where the last inequality follows from (3 .28). To show tha t here the E..'Qua.lity can be achievedby taking the supremum we set '.B = u(Po ,P ). H P;'B and P'B denote the restrictions of P;and P on '.B then, by the definition of .B , p; = dP1J3 jdJPl. Hence B"(Po, P1) = B,(PJ!l , P1J3)by (3.21) and (3.22) so that lr(PJ!l, PjB) = lr(P0 , PI) by (3.26). As the open intervals (a, b)with rational endpoints generate the u-algebra of Borel sets of the real line and the completeimages of (a, b) under Po and p, generate '.B , we see that '.B is countably generated. Thismeans that we find a nondecreasing sequence of algebras 21n that genera t.e '.B. If Pn is thesystem of atoms of 21n then Pn form a nondecreasing sequence of partitions with 21n = O'(Pn)The rest of the proof follows from (3 .33). 0

    4 f- DI VERGENCES, S UFFIC IE NCY AND - DEFIC IENCYI f a model M = (X , 21, (Po)oeLl.) is given and (Y, .B) is a measurable space then a measurablemapping T : X _, Y is called a statistic and the model N = (Y, '.B, (Qo)oeLl.) with Q0 =Po 0 r- I is said to be reduced by the statistic T. Recall that the statistic T : X _, y is saidto be sufficient if for every A E 21 there is a function kA :Y _, JR with the property,(4.1) Eo(/A IT) = kA(T) , Pe-a.s. , (} E fl..The statisticTis called pairwise sufficient if i t is sufficient for each binary model (X, 21, {Po. , Po,}), (}1 ,fl.. It is well known, see e.g. [631that for dominated models a statistic T is sufficient if andonly if T is pairwise sufficient .

    The independence of the conditional probability on the parameter fl extends easily tothe independence of the conditional expectation of any random variable. Suppose thatS: X -> R is a raudom variable with EolSI < oo, } E fl.. If T is sufficient, then there is somemeasurable function ks : T - IR such that,(4 .2) Eo(SIT) = ks(T), Po - a.s.,O E A.

    The independence of the conditional o b a b i l i t i e ~ of t.he parameter was historically thestarting point of the concept of sufficiency. This concept can be traced back to Fisher ill]who considered a statistiC' T to be sufficient if the conditional distribution of any otherstatistic S given T is independent of the parameter so that T contains the complete information. This means that if the value T = t is observed t hen t he knowledge of the value xleading to the observation t contains no additional information about the parameter.As pairwise sufficiency and sufficiency are equivalent for dominated models, in the sequelwe deal only wi t h binary models. Let M = (X,21,{Po, Pi}) be a binary model (Y,'.B) ameasurable space and T : X _, Y a statistic, then N = (Y , .B, {Q0 , Q1 } ), Q; = P; or- 1 isthe reduced model. We set ,

    (4.3) dP,L, := dP'

    - 1and Q = 2(Qo +QJ) ,dQ ;and At; := dQ , i = 0, 1.

    The first consequen(;e of the sufficiency of a statistic T is that t he hypotheses testingproblems(4.4) Ho : Po versus H : P1 and Ho : Qo versus H1 : Q1are equivalent in t he sense that for each test in one of these problems achieving certain errorprobabilities of the first and second kind there is a test in the other problem achieving the

    Advance-s in Inequalities frOI'fl Probability 1'heory & St atistics