Information-Theoretic Limits for Inference, Learning, and...

Information-Theoretic Limits for Inference,Learning, and Optimization

Part 1: Fano’s Inequality for Statistical Estimation

Jonathan Scarlett

Croucher Summer Course in Information Theory (CSCIT)[July 2019]

Information Theory

• How do we quantify “information” in data?

• Information theory [Shannon, 1948]:

I Fundamental limits of data communication

Source Codeword Output Reconstruction

Encoder Channel Decoder

(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)

I Information of source: Entropy

I Information learned at channel output: Mutual information

Principles:

I First fundamental limits without complexity constraints, then practical methods

I First asymptotic analyses, then convergence rates, finite-length, etc.

I Mathematically tractable probabilistic models

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53

Information Theory

(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)

Principles:

Information Theory

(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)

Principles:

Information Theory and Data

• Conventional view:

Information theory is a theory of communicationInference &

Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Inference &Learning

Information Theory

• Emerging view:

Information theory is a theory of data

Inference &Learning

Data Generation

Inference &Learning

Optimization

Inference &Learning

Optimization

Data Generation

Generation

InformationTheory

Inference &Learning

Information Theory

• Extracting information from channel output vs. Extracting information from data

Information Theory and Data

• Conventional view:

Information theory is a theory of communicationInference &

Learning

Data Generation

Inference &Learning

Optimization

Inference &Learning

Optimization

Data Generation

Generation

InformationTheory

Inference &Learning

Information Theory

• Emerging view:

Information theory is a theory of data

Inference &Learning

Data Generation

Inference &Learning

Optimization

Inference &Learning

Optimization

Data Generation

Generation

InformationTheory

Inference &Learning

Information Theory

• Extracting information from channel output vs. Extracting information from data

Examples

• Information theory in machine learning and statistics:I Statistical estimation [Le Cam, 1973]

I Group testing [Malyutov, 1978]

I Multi-armed bandits [Lai and Robbins, 1985]

I Phylogeny [Mossel, 2004]

I Sparse recovery [Wainwright, 2009]

I Graphical model selection [Santhanam and Wainwright, 2012]

I Convex optimization [Agarwal et al., 2012]

I DNA sequencing [Motahari et al., 2012]

I Sparse PCA [Birnbaum et al., 2013]

I Community detection [Abbe, 2014]

I Matrix completion [Riegler et al., 2015]

I Ranking [Shah and Wainwright, 2015]

I Adaptive data analysis [Russo and Zou, 2015]

I Supervised learning [Nokleby, 2016]

I Crowdsourcing [Lahouti and Hassibi, 2016]

I Distributed computation [Lee et al., 2018]

I Bayesian optimization [Scarlett, 2018]

• Note: More than just using entropy / mutual information...

Analogies

Same concepts, different terminology:

Communication Problems Data Problems

Feedback Active learning / adaptivity

Rate-distortion theory Approximate recovery

Joint source-channel coding Non-uniform prior

. . . . . .

Analogies

Same concepts, different terminology:

Channels with feedback Active learning / adaptivity

Rate distortion theory Approximate recovery

Joint source-channel coding Non-uniform prior

Error probability Error probability

Random coding Random sampling

Side information Side information

Channels with memory Statistically dependent measurements

Mismatched decoding Model mismatch

. . . . . .

Cautionary Notes

Some cautionary notes on the information-theoretic viewpoint:

I The simple models we can analyze may be over-simplified(more so than in communication)

I Compared to communication, we often can’t get matching achievability/converse(often settle with correct scaling laws)

I Information-theoretic limits not (yet) considered much in practice(to my knowledge) ... but they do guide the algorithm design

I Often encounter gaps between information-theoretic limits and computation limits

I Often information theory simply isn’t the right tool for the job

Terminology: Achievability and Converse

Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε

I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε

Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples

Terminology: Achievability and Converse

Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε

I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε

Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples

Converse Bounds for Statistical Estimationvia Fano’s Inequality

(Based on survey chapter https://arxiv.org/abs/1901.00555)

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Parameter

Algorithm

Samples

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Parameter

Algorithm

Samples

Estimate

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Parameter

Algorithm

Samples

Estimate

Defining Features

• There are many properties that impact the analysis:

I Discrete θ (e.g., graph learning, sparsity pattern recovery)

I Continuous θ (e.g., regression, density estimation)

I Bayesian θ (average-case performance)

I Minimax bounds over Θ (worst-case performance)

I Non-adaptive inputs (all X1, . . . ,Xn chosen in advance)

I Adaptive inputs (Xi can be chosen based on Y1, . . . ,Yi−1)

• This talk. Minimax bounds, mostly non-adaptive, first discrete and then continuous

High-Level Steps

Steps in attaining a minimax lower bound (converse):

1. Reduce estimation problem to multiple hypothesis testing

2. Apply a form of Fano’s inequality

3. Bound the resulting mutual information term

(Multiple hypothesis testing: Given samples Y1, . . . ,Yn, determine which distributionamong P1(y), . . . ,PM(y) generated them. M = 2 gives binary hypothesis testing.)

Step I: Reduction to Multiple Hypothesis Testing

• Lower bound worst-case error by average over hard subset θ1, . . . , θM :

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

I Show “successful” algorithm θ =⇒ Correct estimation of V (When is this true?)

I Equivalent statement: If V can’t be estimated reliably, then θ can’t be successful.

Step I: Example

• Example: Suppose algorithm is claimed to return θ such that ‖θ − θ‖2 ≤ ε

• If θ1, . . . , θM are separated by 2ε, then we can identify the correct V ∈ 1, . . . ,M

• Note: Tension between number of hypotheses, difficulty in distinguishing them, andsufficient separation. Choosing a suitable set θ1, . . . , θM can be very challenging.

Step II: Application of Fano’s Inequality

• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :

P[V 6= V ] ≥ 1−I (V ; V ) + log 2

log M.

I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M

• Variations:

I Non-uniform V

I Approximate recovery

I Conditional version

Step II: Application of Fano’s Inequality

• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :

P[V 6= V ] ≥ 1−I (V ; V ) + log 2

log M.

I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M

• Variations:

I Non-uniform V

I Approximate recovery

I Conditional version

Step III: Bounding the Mutual Information

• The key quantity remaining after applying Fano’s inequality is I (V ; V )

• Data processing inequality: (Based on V → Y → V or similar)

I No inputs: I (V ; V ) ≤ I (V ; Y)

I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)

I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)

• Tensorization: (Based on conditional independence of the samples)

I No inputs: I (V ; Y) ≤∑n

i=1 I (V ;Yi )

I Non-adaptive inputs: I (V ; Y|X) ≤∑n

i=1 I (V ;Yi |Xi )

I Adaptive inputs: I (V ; X,Y) ≤∑n

i=1 I (V ;Yi |Xi )

• KL Divergence Bounds:

I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)

)I I (V ;Y ) ≤ maxv D

(PY |V (·|v)‖QY

)for any QY

I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε

I (Similarly with conditioning on X )

i=1 I (V ;Yi )

i=1 I (V ;Yi |Xi )

(PY |V (·|v)‖QY

)for any QY

i=1 I (V ;Yi )

i=1 I (V ;Yi |Xi )

(PY |V (·|v)‖QY

)for any QY

Discrete Example 1

Group Testing

Outcomes

Contaminated ContaminatedClean Clean

I Goal:

Given test matrix X and outcomes Y, recover item vector β

I Sample complexity: Required number of tests n

Group Testing

Outcomes

Contaminated ContaminatedClean Clean

I Goal:

Given test matrix X and outcomes Y, recover item vector β

I Sample complexity: Required number of tests n

Information Theory and Group Testing

Outcomes

• Information-theoretic viewpoint:

S : Defective setXS : Columns indexed by S

Codeword

Y ∼ P nY |XS S

Message

OutputEstimate

Information Theory and Group Testing

• Example formulation of general result:

Entropy

Mutual Information

(Model uncertainty)

(Information learned from measurements)

Sample complexity

n H(S)

I(PY |XS)

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Trivial! Set V = S .

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

)• Bounding the mutual information:

I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs

I Tensorization: I (U; Y) ≤∑n

i=1 I (Ui ;Yi )

I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C

• Final result:

n ≤log(pk

(1− ε) =⇒ P[S 6= S] 6→ 0

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

• Bounding the mutual information:

i=1 I (Ui ;Yi )

• Final result:

n ≤log(pk

(1− ε) =⇒ P[S 6= S] 6→ 0

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

i=1 I (Ui ;Yi )

• Final result:

n ≤log(pk

(1− ε) =⇒ P[S 6= S] 6→ 0

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

i=1 I (Ui ;Yi )

• Final result:

n ≤log(pk

(1− ε) =⇒ P[S 6= S] 6→ 0

Noiseless Group Testing (Exact Recovery)

0 0.2 0.4 0.6 0.8 1

Informaton-Theoretic

Counting Bound

Bernoulli

Near-Const.

• Other Implications:I Adaptivity and approximate recovery:

I No gain at low sparsity levelsI Significant gain at high sparsity levels

I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels

I Analogous results in noisy settings

Noiseless Group Testing (Exact Recovery)

0 0.2 0.4 0.6 0.8 1

Informaton-Theoretic

Counting Bound

Bernoulli

Near-Const.

• Other Implications:I Adaptivity and approximate recovery:

I No gain at low sparsity levelsI Significant gain at high sparsity levels

I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels

I Analogous results in noisy settings

Detour: An Information-Theoretic Achievability Example

A Practical Approach: Separate Decoding of Items

• Searching over(pk

)subsets is infeasible

• Practical alternative: Separate decoding of itemsItems

Outcomes

• Channel coding interpretation:

1 X1 Y 1

Y pp Xp

Encoder 1

Encoder p

Channel 1

Channel p

Decoder 1

Decoder p

• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)

A Practical Approach: Separate Decoding of Items

• Searching over(pk

)subsets is infeasible

• Practical alternative: Separate decoding of itemsItems

Outcomes

• Channel coding interpretation:

1 X1 Y 1

Y pp Xp

Encoder 1

Encoder p

Channel 1

Channel p

Decoder 1

Decoder p

• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)

Example Results (Symmetric Noise)

• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)

I False pos/neg: Small #false positives/negatives allowed in reconstruction

0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)

log2p=k

Joint false pos/neg

Joint exact

False pos/neg

False neg

False pos

Best previous practical

Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity

SDI Achievability Proof 1 (Notation)

• Define the defectivity indicator random variable

βj = 1j ∈ S

• Separate decoding of items: Decode βj based only on and Y

I Xj : j-th column of X

• Binary hypothesis testing based rule

βj (Xj ,Y) = 1

n∑i=1

logPY |Xj ,βj

(Y (i)|X (i)j , 1)

PY (Y (i))> γ

I X (i) : i-th row of X

I Y (i) : i-th test outcome

I γ : Threshold to be set later

SDI Achievability Proof 1 (Notation)

• Define the defectivity indicator random variable

βj = 1j ∈ S

• Separate decoding of items: Decode βj based only on and Y

I Xj : j-th column of X

• Binary hypothesis testing based rule

βj (Xj ,Y) = 1

n∑i=1

logPY |Xj ,βj

(Y (i)|X (i)j , 1)

PY (Y (i))> γ

I X (i) : i-th row of X

I Y (i) : i-th test outcome

I γ : Threshold to be set later

SDI Achievability Proof 2 (Non-Asymptotic Bound)

Non-asymptotic bound: Under Bernoulli testing, we have

Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ

ın1(Xj ,Y) :=n∑

ı1(X(i)j ,Y (i)), ı1(x1, y) := log

PY |X1(y |x1)

PY (y)

is the information density

Proof idea:

I First term: Union bound over k false negative events

I Second term: Union bound over p − k false positive events & change of measure

SDI Achievability Proof 2 (Non-Asymptotic Bound)

Non-asymptotic bound: Under Bernoulli testing, we have

Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ

ın1(Xj ,Y) :=n∑

ı1(X(i)j ,Y (i)), ı1(x1, y) := log

PY |X1(y |x1)

PY (y)

is the information density

Proof idea:

I First term: Union bound over k false negative events

I Second term: Union bound over p − k false positive events & change of measure

SDI Achievability Proof 3 (Asymptotic Analysis)

Simplification: If we have concentration of the form

P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)

then the following conditions suffice for Pe → 0:

n ≥log(

(p − k))

I (X1;Y )(1− δ2)

k · ψn(δ2)→ 0

I Proved by setting γ = log p−kδ1

in initial result

I With false positives, can instead use γ = log p−kkδ1

I With false negatives, it suffices that ψn(δ2)→ 0

SDI Achievability Proof 3 (Asymptotic Analysis)

Simplification: If we have concentration of the form

P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)

then the following conditions suffice for Pe → 0:

n ≥log(

(p − k))

I (X1;Y )(1− δ2)

k · ψn(δ2)→ 0

I Proved by setting γ = log p−kδ1

in initial result

I With false positives, can instead use γ = log p−kkδ1

I With false negatives, it suffices that ψn(δ2)→ 0

SDI Achievability Proof 4 (Concentration Bounds)

Bernstein’s inequality:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

( − 12· nk· c2

meanδ22

cvar + 13cmeancmaxδ2

)for suitable constants cmean, cvar, cmax

Refined concentration for noiseless model:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

n log 2

((1− δ2) log(1− δ2) + δ2

)(1 + o(1))

SDI Achievability Proof 4 (Concentration Bounds)

Bernstein’s inequality:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

( − 12· nk· c2

meanδ22

cvar + 13cmeancmaxδ2

)for suitable constants cmean, cvar, cmax

Refined concentration for noiseless model:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

n log 2

((1− δ2) log(1− δ2) + δ2

)(1 + o(1))

SDI Achievability Proof 5 (Final Conditions)

Final noiseless result: We have Pe → 0 if

n ≥ minδ2>0

k log p

(log 2)2(1− δ2),

k log k

(log 2)((1− δ2) log(1− δ2) + δ2

)(1 + η)

Final noisy result: We have Pe → 0 if

n ≥ minδ2>0

k log p

(log 2)(log 2− H2(ρ))(1− δ2),

(k log k) ·(cvar + 1

3cmeancmaxδ2

)12· c2

meanδ22

(1+η)

Approximate recovery: We have P(approx)e → 0 if

n ≥log p

I (X1;Y )(1 + η)

n ≥ minδ2>0

k log p

(log 2)2(1− δ2),

k log k

(log 2)((1− δ2) log(1− δ2) + δ2

)(1 + η)

n ≥ minδ2>0

k log p

(log 2)(log 2− H2(ρ))(1− δ2),

3cmeancmaxδ2

)12· c2

meanδ22

(1+η)

n ≥log p

I (X1;Y )(1 + η)

n ≥ minδ2>0

k log p

(log 2)2(1− δ2),

k log k

(log 2)((1− δ2) log(1− δ2) + δ2

)(1 + η)

n ≥ minδ2>0

k log p

(log 2)(log 2− H2(ρ))(1− δ2),

3cmeancmaxδ2

)12· c2

meanδ22

(1+η)

n ≥log p

I (X1;Y )(1 + η)

Example Results (Symmetric Noise)

• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)

I False pos/neg: Small #false positives/negatives allowed in reconstruction

0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)

log2p=k

Joint false pos/neg

Joint exact

False pos/neg

False neg

False pos

Best previous practical

Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity

Discrete Example 2

Graphical Model Selection

Graphical Model Representations of Joint Distributions

Motivating example:

I In a population of p people, let

1 person i is infected

−1 person i is healthy,i = 1, . . . , p

I Example models:

[Abbe and Wainwright, ISIT Tutorial, 2015]

I Joint distribution for a given graph G = (V ,E):

P[(Y1, . . . ,Yp) = (y1, . . . , yp)

( ∑(i,j)∈E

λijyiyj

Motivating example:

I Example models:

P[(Y1, . . . ,Yp) = (y1, . . . , yp)

( ∑(i,j)∈E

λijyiyj

Motivating example:

I Example models:

P[(Y1, . . . ,Yp) = (y1, . . . , yp)

( ∑(i,j)∈E

λijyiyj

Graphical Model Selection: Illustration

I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:

I Sample images (Ising model):

I A larger example from [Abbe and Wainwright, ISIT Tutorial, 2015]:I Example graphs:

I Goal: Identify graph given n independent samples

Graphical Model Selection: Definition

General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G

I Applications: Statistical physics, social and biological networks

I Error probability:Pe = max

G∈GP[G 6= G |G ].

Assumptions.I Distribution class:

I Ising model

PG (x1, . . . , xp)]

( ∑(i,j)∈E

λijxixj

)I Gaussian model

(X1, . . . ,Xp) ∼ N (µ,Σ)

where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]

I Graph class:

I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)

Graphical Model Selection: Definition

General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G

I Applications: Statistical physics, social and biological networks

I Error probability:Pe = max

G∈GP[G 6= G |G ].

Assumptions.I Distribution class:

I Ising model

PG (x1, . . . , xp)]

( ∑(i,j)∈E

λijxixj

)I Gaussian model

(X1, . . . ,Xp) ∼ N (µ,Σ)

where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]

I Graph class:

I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)

Information-Theoretic Viewpoint

• Information-theoretic viewpoint:

ChannelDecoder

• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)

I Ideally close together (harder to distinguish)

P[S 6= S] ≥ 1−I (G ; G) + log 2

log |G0|

I Data processing inequality: I (G ; G) ≤ I (G ; Y)

I Tensorization: I (G ; Y) ≤∑n

i=1 I (G ;Yi )

I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case

P[S 6= S] ≥ 1−I (G ; G) + log 2

log |G0|

i=1 I (G ;Yi )

P[S 6= S] ≥ 1−I (G ; G) + log 2

log |G0|

i=1 I (G ;Yi )

Graph Ensembles

I Graphs that are difficult to distinguish from the empty graph:

I Reveals n = Ω(

1λ2 log p

)necessary condition with “edge strength” λ and p nodes

I Graphs that are difficult to distinguish from the complete (sub-)graph:

I Reveals n = Ω(eλd)

necessary condition with “edge strength” λ and degree d

Graph Ensembles

I Graphs that are difficult to distinguish from the empty graph:

I Reveals n = Ω(

1λ2 log p

)necessary condition with “edge strength” λ and p nodes

I Graphs that are difficult to distinguish from the complete (sub-)graph:

I Reveals n = Ω(eλd)

necessary condition with “edge strength” λ and degree d

Upper vs. Lower Bounds

• Example results with maximal degree d , edge strength λ (slightly informal):

I (Converse) n = Ω(

1λ2 , e

log p)

[Santhanam and Wainwright, 2012]

I (Achievability) n = O(

1λ2 , e

λdd log p

)[Santhanam and Wainwright, 2012]

I (Early Practical) n = O(d2 log p) but extra assumptions that are hard to cerify[Ravikumar/Wainwright/Lafferty, 2010]

I (Recent Practical) n = O(d2eλd

λ2 log p)

[Klivans/Meka 2017]

I ...and more [Lokhov et al., 2018][Wu/Sanghavi/Dimakis 2018]

Graphical Model Selection with Twists

• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]

Pe(dmax) = maxG∈G

P[dH(E , E) > dmax |G ]

• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]

Decoder

Choose

GGSample

• Analogies

Pe(dmax) = maxG∈G

Decoder

Choose

GGSample

• Analogies

Pe(dmax) = maxG∈G

Decoder

Choose

GGSample

• Analogies

What About Continuous-Valued Estimation?

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Parameter

Algorithm

Samples

Estimate

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Parameter

Algorithm

Samples

Estimate

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Parameter

Algorithm

Samples

Estimate

Minimax Risk

• Since the samples are random, so is θ and hence `(θ, θ)

• So seek to minimize the average loss Eθ[`(θ, θ)].

I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.

• Minimax risk:Mn(Θ, `) = inf

θsupθ∈Θ

Eθ[`(θ, θ)

i.e., worst case average loss over all θ ∈ Θ

• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Minimax Risk

θsupθ∈Θ

Eθ[`(θ, θ)

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Minimax Risk

θsupθ∈Θ

Eθ[`(θ, θ)

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

General Lower Bound via Fano’s Inequality

• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form

`(θ, θ) = Φ(ρ(θ, θ)

)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function

(e.g., `(θ, θ) = ‖θ − θ‖2)

• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that

ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.

Then, we have

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.

General Lower Bound via Fano’s Inequality

• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form

`(θ, θ) = Φ(ρ(θ, θ)

)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function

(e.g., `(θ, θ) = ‖θ − θ‖2)

• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that

ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.

Then, we have

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.

Proof of General Lower Bound• Using Markov’s inequality:

supθ∈Θ

Eθ[`(θ, θ)

]≥ supθ∈Θ

Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]

= Φ(ε0) supθ∈Θ

Pθ[ρ(θ, θ) ≥ ε0]

• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and

ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2

then we must have V = v :

Pθv[ρ(θv , θ) ≥

]≥ Pθv [V 6= v ].

• Hence,

supθ∈Θ

Pθ[ρ(θ, θ) ≥

]≥ max

v=1,...,MPθv

[ρ(θv , θ) ≥

]≥ max

v=1,...,MPθv [V 6= v ]

∑v=1,...,M

Pθv [V 6= v ]

≥ 1−I (V ; Y) + log 2

where the final step uses Fano’s inequality.

supθ∈Θ

Eθ[`(θ, θ)

]≥ supθ∈Θ

Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]

Pθ[ρ(θ, θ) ≥ ε0]

ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2

]≥ Pθv [V 6= v ].

• Hence,

supθ∈Θ

Pθ[ρ(θ, θ) ≥

]≥ max

v=1,...,MPθv

[ρ(θv , θ) ≥

]≥ max

v=1,...,MPθv [V 6= v ]

∑v=1,...,M

Pθv [V 6= v ]

≥ 1−I (V ; Y) + log 2

supθ∈Θ

Eθ[`(θ, θ)

]≥ supθ∈Θ

Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]

Pθ[ρ(θ, θ) ≥ ε0]

ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2

]≥ Pθv [V 6= v ].

• Hence,

supθ∈Θ

Pθ[ρ(θ, θ) ≥

]≥ max

v=1,...,MPθv

[ρ(θv , θ) ≥

]≥ max

v=1,...,MPθv [V 6= v ]

∑v=1,...,M

Pθv [V 6= v ]

≥ 1−I (V ; Y) + log 2

Local Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

• Local approach: Carefully-chosen “local” hard subset:

x xx x

KL Divergence Ball

x xx x

Resulting bound:

Mn(Θ, `) ≥ Φ( ε

)(1−

minv=1,...,M D(Pnθv‖Qn

Y ) + log 2

Local Approach

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

x xx x

KL Divergence Ball

x xx x

Resulting bound:

Mn(Θ, `) ≥ Φ( ε

)(1−

Y ) + log 2

Local Approach

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

x xx x

KL Divergence Ball

x xx x

Resulting bound:

Mn(Θ, `) ≥ Φ( ε

)(1−

Y ) + log 2

Global Approach

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

• Global approach: Pack as many ε-separated points as possible:

x xx x

KL Divergence Ball

x xx x

I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)

• Resulting bound:

Mn(Θ, `) ≥ Φ( εp

)(1−

log N∗KL,n(Θ, εc,n) + εc,n + log 2

log M∗ρ (Θ, εp)

I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)

Global Approach

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

x xx x

KL Divergence Ball

x xx x

Mn(Θ, `) ≥ Φ( εp

)(1−

Global Approach

Mn(Θ, `) ≥ Φ( ε

)(1−

I (V ; Y) + log 2

x xx x

KL Divergence Ball

x xx x

Mn(Θ, `) ≥ Φ( εp

)(1−

Continuous Example 1

Sparse Linear Regression

• Linear regression model Y = Xθ + Z:

2 RpY 2 Rn X 2 Rnp

Z 2 Rn

Samples Feature Matrix Coefficients Noise

I Feature matrix X is given, noise is i.i.d. N(0, σ2)

I Coefficients are sparse – at most k non-zeros

• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

I Total number of such sequences = 2k(pk

)≈ exp

(k log p

)(if k p)

I Choose a “well separated” subset of size exp(k4

log pk

)(Gilbert-Varshamov)

I Well-separated: Non-zero entries differ in at least k8

indices

• Application of Fano’s inequality:

I Using the general bound given previously:

Mn(Θ, `) ≥kε2

I (V ; Y|X) + log 2k4

log pk

I By a direct calculation, I (V ; Y|X) ≤ ε2

2σ2 · kp ‖X‖2F (Gaussian RVs)

I Substitute and choose ε to optimize the bound: ε2 =σ2p log p

2‖X‖2F

• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2

2] ≤ δ requires n ≥ cσ2

Γδ· k log p

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

)≈ exp

(k log p

)(if k p)

log pk

indices

Mn(Θ, `) ≥kε2

I (V ; Y|X) + log 2k4

log pk

2‖X‖2F

Γδ· k log p

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

)≈ exp

(k log p

)(if k p)

log pk

indices

Mn(Θ, `) ≥kε2

I (V ; Y|X) + log 2k4

log pk

2‖X‖2F

Γδ· k log p

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

)≈ exp

(k log p

)(if k p)

log pk

indices

Mn(Θ, `) ≥kε2

I (V ; Y|X) + log 2k4

log pk

2‖X‖2F

Γδ· k log p

• Recap of model: Y = Xθ + Z, where θ is k-sparse

• Lower bound: If ‖X‖2F ≤ npΓ, achieving E[‖θ − θ‖2

Γδ· k log p

• Upper bound: If X is a zero-mean random Gaussian matrix with power Γ per entry,

then we can achieve E[‖θ − θ‖22] ≤ δ using at most n ≥ c′σ2

Γδ· k log p

ksamples

I Maximum-likelihood estimation suffices

• Tighter lower bounds could potentially be obtained under additional restrictions on X

Continuous Example 2

Convex Optimization

Stochastic Convex Optimization

• A basic optimization problem

x? = argminx∈D f (x)

For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)

• Model:

I Noisy samples: When we query x , we get a noisy value and noisy gradient:

Y = f (x) + Z , Y ′ = f ′(x) + Z ′

where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)

I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1

• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.

I We will focus on the class of strongly convex functions

I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)

• Model:

Y = f (x) + Z , Y ′ = f ′(x) + Z ′

where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)

• Model:

Y = f (x) + Z , Y ′ = f ′(x) + Z ′

where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)

Performance Measure and Minimax Risk

• After sampling n points, the algorithm returns a final point x

• The loss incurred is `f (x) = f (x)−minx∈X f (x), i.e., the gap to the optimum

• For a given class of functions F , the minimax risk is given by

Mn(F) = infX

supf∈F

Ef [`f (X )]

Reduction to Multiple Hypothesis Testing

• The picture remains the same:

Index V Select

Algorithm

Samples

InferEstimate V

Function fV

Functionxy

xSelection

I Successful optimization =⇒ Successful identification of V

General Minimax Lower Bound

• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have

Mn(F) ≥ ε ·(

1−I (V ; X,Y) + log 2

), (1)

where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).

• Claim 2. In the special case M = 2, we have

Mn(F) ≥ ε · H−12

(log 2− I (V ; X,Y)

), (2)

where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.

• Proof is like with estimation, starting with Markov’s inequality:

supf∈F

Ef [`f (X )] ≥ supf∈F

ε · Pf [`f (X ) ≥ ε].

• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing

Mn(F) ≥ ε ·(

1−I (V ; X,Y) + log 2

), (1)

Mn(F) ≥ ε · H−12

(log 2− I (V ; X,Y)

), (2)

supf∈F

ε · Pf [`f (X ) ≥ ε].

Mn(F) ≥ ε ·(

1−I (V ; X,Y) + log 2

), (1)

Mn(F) ≥ ε · H−12

(log 2− I (V ; X,Y)

), (2)

supf∈F

ε · Pf [`f (X ) ≥ ε].

Strongly Convex Class: Choice of Hard Subset

• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!

I (Becomes 2constant×d in d dimensions)

0 0.2 0.4 0.6 0.8 1

Input x

Functionvalue

• The precise functions:

fv (x) =1

2(x − x∗v )2, v = 1, 2,

x∗1 =1

2−√

2ε′ x∗2 =1

2ε′

Strongly Convex Class: Choice of Hard Subset

• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!

I (Becomes 2constant×d in d dimensions)

0 0.2 0.4 0.6 0.8 1

Input x

Functionvalue

• The precise functions:

fv (x) =1

2(x − x∗v )2, v = 1, 2,

x∗1 =1

2−√

2ε′ x∗2 =1

2ε′

Analysis

• Application of Fano’s Inequality. As above,

Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))

I Approach: H−12 (α) ≥ 1

10if α ≥ log 2

I How few samples ensure I (V ; X,Y) ≤ log 22

• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1

2x2. Then:

D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2

(f ′1 (x)− f ′0 (x))2

I Simplifications: (f1(x)− f0(x))2 ≤(ε+

√ε2

)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε

I With some manipulation, I (V ; X,Y) ≤ log 22

when ε = σ2 log 24n

• Final result: Mn(F) ≥ σ2 log 240n

Analysis

Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))

10if α ≥ log 2

2x2. Then:

D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2

(f ′1 (x)− f ′0 (x))2

√ε2

)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε

Analysis

Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))

10if α ≥ log 2

2x2. Then:

D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2

(f ′1 (x)− f ′0 (x))2

√ε2

)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε

• Lower bound for 1D strongly convex functions: Mn(F) ≥ c σ2

• Upper bound for 1D strongly convex functions: Mn(F) ≤ c ′ σ2

I Achieved by stochastic gradient descent

• Analogous results (and proof techniques) known for d-dimensional functions,additional Lipschitz assumptions, etc. [Raginsky and Rakhlin, 2011]

Limitations and Generalizations

• Limitations of Fano’s Inequality.

I Non-asymptotic weakness

I Typically hard to tightly bound mutual information in adaptive settings

I Restriction to KL divergence

• Generalizations of Fano’s Inequality.

I Non-uniform V [Han/Verdu, 1994]

I More general f -divergences [Guntuboyina, 2011]

I Continuous V [Duchi/Wainwright, 2013]

(This list is certainly incomplete!)

Example: Difficulties in Adaptive Settings

• A simple search problem: Find the (only) biased coin using few flips

| z | z P[heads] = 1

2 P[heads] = 12

P[heads] = 12 +

I Heavy coin V ∈ 1, . . . ,M uniformly at random

I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)

• Non-adaptive setting:

I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2

I Substituting into Fano’s inequality gives the requirement n & M log Mε2

• Adaptive setting:

I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!

I Worst-case bounding only gives n & log Mε2

I Next lecture: An alternative tool that gives n & Mε2

2 P[heads] = 12

P[heads] = 12 +

2 P[heads] = 12

P[heads] = 12 +

Conclusion

• Information theory as a theory of data:

Inference &Learning

Data Generation

Inference &Learning

Optimization

Inference &Learning

Optimization

Data Generation

Generation

InformationTheory

Inference &Learning

Information Theory

• Approach highlighted in this talk:

I Reduction to multiple hypothesis testing

I Application of Fano’s inequality

I Bounding the mutual information

• Examples:

I Group testing

I Graphical model selection

I Sparse regression

I Convex optimization

I ...hopefully many more to come!

Conclusion

Inference &Learning

Data Generation

Inference &Learning

Optimization

Inference &Learning

Optimization

Data Generation

Generation

InformationTheory

Inference &Learning

Information Theory

• Examples:

I Group testing

I Sparse regression

Conclusion

Inference &Learning

Data Generation

Inference &Learning

Optimization

Inference &Learning

Optimization

Data Generation

Generation

InformationTheory

Inference &Learning

Information Theory

• Examples:

I Group testing

I Sparse regression

Tutorial Chapter

• Tutorial Chapter: “An Introductory Guide to Fano’s Inequalitywith Applications in Statistical Estimation” [S. and Cevher, 2019]

https://arxiv.org/abs/1901.00555

(To appear in book Information-Theoretic Methods in DataScience, Cambridge University Press)

Information-Theoretic Limits for Inference, Learning, and...

Documents

Transcript of Information-Theoretic Limits for Inference, Learning, and...