Information-Theoretic Limits for Inference, Learning, and...
Transcript of Information-Theoretic Limits for Inference, Learning, and...
Information-Theoretic Limits for Inference,Learning, and Optimization
Part 1: Fano’s Inequality for Statistical Estimation
Jonathan Scarlett
Croucher Summer Course in Information Theory (CSCIT)[July 2019]
Information Theory
• How do we quantify “information” in data?
• Information theory [Shannon, 1948]:
I Fundamental limits of data communication
Source Codeword Output Reconstruction
Encoder Channel Decoder
(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)
I Information of source: Entropy
I Information learned at channel output: Mutual information
Principles:
I First fundamental limits without complexity constraints, then practical methods
I First asymptotic analyses, then convergence rates, finite-length, etc.
I Mathematically tractable probabilistic models
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53
Information Theory
• How do we quantify “information” in data?
• Information theory [Shannon, 1948]:
I Fundamental limits of data communication
Source Codeword Output Reconstruction
Encoder Channel Decoder
(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)
I Information of source: Entropy
I Information learned at channel output: Mutual information
Principles:
I First fundamental limits without complexity constraints, then practical methods
I First asymptotic analyses, then convergence rates, finite-length, etc.
I Mathematically tractable probabilistic models
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53
Information Theory
• How do we quantify “information” in data?
• Information theory [Shannon, 1948]:
I Fundamental limits of data communication
Source Codeword Output Reconstruction
Encoder Channel Decoder
(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)
I Information of source: Entropy
I Information learned at channel output: Mutual information
Principles:
I First fundamental limits without complexity constraints, then practical methods
I First asymptotic analyses, then convergence rates, finite-length, etc.
I Mathematically tractable probabilistic models
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53
Information Theory and Data
• Conventional view:
Information theory is a theory of communicationInference &
Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Emerging view:
Information theory is a theory of data
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Extracting information from channel output vs. Extracting information from data
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 2/ 53
Information Theory and Data
• Conventional view:
Information theory is a theory of communicationInference &
Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Emerging view:
Information theory is a theory of data
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Extracting information from channel output vs. Extracting information from data
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 2/ 53
Examples
• Information theory in machine learning and statistics:I Statistical estimation [Le Cam, 1973]
I Group testing [Malyutov, 1978]
I Multi-armed bandits [Lai and Robbins, 1985]
I Phylogeny [Mossel, 2004]
I Sparse recovery [Wainwright, 2009]
I Graphical model selection [Santhanam and Wainwright, 2012]
I Convex optimization [Agarwal et al., 2012]
I DNA sequencing [Motahari et al., 2012]
I Sparse PCA [Birnbaum et al., 2013]
I Community detection [Abbe, 2014]
I Matrix completion [Riegler et al., 2015]
I Ranking [Shah and Wainwright, 2015]
I Adaptive data analysis [Russo and Zou, 2015]
I Supervised learning [Nokleby, 2016]
I Crowdsourcing [Lahouti and Hassibi, 2016]
I Distributed computation [Lee et al., 2018]
I Bayesian optimization [Scarlett, 2018]
• Note: More than just using entropy / mutual information...
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 3/ 53
Analogies
Same concepts, different terminology:
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Joint source-channel coding Non-uniform prior
. . . . . .
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 4/ 53
Analogies
Same concepts, different terminology:
Communication Problems Data Problems
Channels with feedback Active learning / adaptivity
Rate distortion theory Approximate recovery
Joint source-channel coding Non-uniform prior
Error probability Error probability
Random coding Random sampling
Side information Side information
Channels with memory Statistically dependent measurements
Mismatched decoding Model mismatch
. . . . . .
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 4/ 53
Cautionary Notes
Some cautionary notes on the information-theoretic viewpoint:
I The simple models we can analyze may be over-simplified(more so than in communication)
I Compared to communication, we often can’t get matching achievability/converse(often settle with correct scaling laws)
I Information-theoretic limits not (yet) considered much in practice(to my knowledge) ... but they do guide the algorithm design
I Often encounter gaps between information-theoretic limits and computation limits
I Often information theory simply isn’t the right tool for the job
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 5/ 53
Terminology: Achievability and Converse
Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε
I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε
Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 6/ 53
Terminology: Achievability and Converse
Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε
I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε
Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 6/ 53
Converse Bounds for Statistical Estimationvia Fano’s Inequality
(Based on survey chapter https://arxiv.org/abs/1901.00555)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53
Defining Features
• There are many properties that impact the analysis:
I Discrete θ (e.g., graph learning, sparsity pattern recovery)
I Continuous θ (e.g., regression, density estimation)
I Bayesian θ (average-case performance)
I Minimax bounds over Θ (worst-case performance)
I Non-adaptive inputs (all X1, . . . ,Xn chosen in advance)
I Adaptive inputs (Xi can be chosen based on Y1, . . . ,Yi−1)
• This talk. Minimax bounds, mostly non-adaptive, first discrete and then continuous
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 8/ 53
High-Level Steps
Steps in attaining a minimax lower bound (converse):
1. Reduce estimation problem to multiple hypothesis testing
2. Apply a form of Fano’s inequality
3. Bound the resulting mutual information term
(Multiple hypothesis testing: Given samples Y1, . . . ,Yn, determine which distributionamong P1(y), . . . ,PM(y) generated them. M = 2 gives binary hypothesis testing.)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 9/ 53
Step I: Reduction to Multiple Hypothesis Testing
• Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Idea:
I Show “successful” algorithm θ =⇒ Correct estimation of V (When is this true?)
I Equivalent statement: If V can’t be estimated reliably, then θ can’t be successful.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 10/ 53
Step I: Example
• Example: Suppose algorithm is claimed to return θ such that ‖θ − θ‖2 ≤ ε
• If θ1, . . . , θM are separated by 2ε, then we can identify the correct V ∈ 1, . . . ,M
• Note: Tension between number of hypotheses, difficulty in distinguishing them, andsufficient separation. Choosing a suitable set θ1, . . . , θM can be very challenging.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 11/ 53
Step II: Application of Fano’s Inequality
• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :
P[V 6= V ] ≥ 1−I (V ; V ) + log 2
log M.
I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M
• Variations:
I Non-uniform V
I Approximate recovery
I Conditional version
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 12/ 53
Step II: Application of Fano’s Inequality
• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :
P[V 6= V ] ≥ 1−I (V ; V ) + log 2
log M.
I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M
• Variations:
I Non-uniform V
I Approximate recovery
I Conditional version
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 12/ 53
Step III: Bounding the Mutual Information
• The key quantity remaining after applying Fano’s inequality is I (V ; V )
• Data processing inequality: (Based on V → Y → V or similar)
I No inputs: I (V ; V ) ≤ I (V ; Y)
I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)
I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)
• Tensorization: (Based on conditional independence of the samples)
I No inputs: I (V ; Y) ≤∑n
i=1 I (V ;Yi )
I Non-adaptive inputs: I (V ; Y|X) ≤∑n
i=1 I (V ;Yi |Xi )
I Adaptive inputs: I (V ; X,Y) ≤∑n
i=1 I (V ;Yi |Xi )
• KL Divergence Bounds:
I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)
)I I (V ;Y ) ≤ maxv D
(PY |V (·|v)‖QY
)for any QY
I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε
I (Similarly with conditioning on X )
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53
Step III: Bounding the Mutual Information
• The key quantity remaining after applying Fano’s inequality is I (V ; V )
• Data processing inequality: (Based on V → Y → V or similar)
I No inputs: I (V ; V ) ≤ I (V ; Y)
I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)
I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)
• Tensorization: (Based on conditional independence of the samples)
I No inputs: I (V ; Y) ≤∑n
i=1 I (V ;Yi )
I Non-adaptive inputs: I (V ; Y|X) ≤∑n
i=1 I (V ;Yi |Xi )
I Adaptive inputs: I (V ; X,Y) ≤∑n
i=1 I (V ;Yi |Xi )
• KL Divergence Bounds:
I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)
)I I (V ;Y ) ≤ maxv D
(PY |V (·|v)‖QY
)for any QY
I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε
I (Similarly with conditioning on X )
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53
Step III: Bounding the Mutual Information
• The key quantity remaining after applying Fano’s inequality is I (V ; V )
• Data processing inequality: (Based on V → Y → V or similar)
I No inputs: I (V ; V ) ≤ I (V ; Y)
I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)
I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)
• Tensorization: (Based on conditional independence of the samples)
I No inputs: I (V ; Y) ≤∑n
i=1 I (V ;Yi )
I Non-adaptive inputs: I (V ; Y|X) ≤∑n
i=1 I (V ;Yi |Xi )
I Adaptive inputs: I (V ; X,Y) ≤∑n
i=1 I (V ;Yi |Xi )
• KL Divergence Bounds:
I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)
)I I (V ;Y ) ≤ maxv D
(PY |V (·|v)‖QY
)for any QY
I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε
I (Similarly with conditioning on X )
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53
Discrete Example 1
Group Testing
Group Testing
Items
Tests
Outcomes
Contaminated ContaminatedClean Clean
I Goal:
Given test matrix X and outcomes Y, recover item vector β
I Sample complexity: Required number of tests n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 14/ 53
Group Testing
Items
Tests
Outcomes
Contaminated ContaminatedClean Clean
I Goal:
Given test matrix X and outcomes Y, recover item vector β
I Sample complexity: Required number of tests n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 14/ 53
Information Theory and Group Testing
Items
Tests
Outcomes
• Information-theoretic viewpoint:
S : Defective setXS : Columns indexed by S
S
Codeword
Y ∼ P nY |XS S
Encoder Channel Decoder
Message
XS
OutputEstimate
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 15/ 53
Information Theory and Group Testing
• Example formulation of general result:
Entropy
Mutual Information
(Model uncertainty)
(Information learned from measurements)
Sample complexity
n H(S)
I(PY |XS)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 15/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)
• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
Noiseless Group Testing (Exact Recovery)
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
DD
Informaton-Theoretic
Counting Bound
Bernoulli
COMP
Near-Const.
• Other Implications:I Adaptivity and approximate recovery:
I No gain at low sparsity levelsI Significant gain at high sparsity levels
I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels
I Analogous results in noisy settings
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 17/ 53
Noiseless Group Testing (Exact Recovery)
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
DD
Informaton-Theoretic
Counting Bound
Bernoulli
COMP
Near-Const.
• Other Implications:I Adaptivity and approximate recovery:
I No gain at low sparsity levelsI Significant gain at high sparsity levels
I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels
I Analogous results in noisy settings
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 17/ 53
Detour: An Information-Theoretic Achievability Example
A Practical Approach: Separate Decoding of Items
• Searching over(pk
)subsets is infeasible
• Practical alternative: Separate decoding of itemsItems
Tests
Outcomes
Items
Tests
Outcomes
• Channel coding interpretation:
1 X1 Y 1
Y pp Xp
Encoder 1
Encoder p
Channel 1
Channel p
Decoder 1
Decoder p
• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 18/ 53
A Practical Approach: Separate Decoding of Items
• Searching over(pk
)subsets is infeasible
• Practical alternative: Separate decoding of itemsItems
Tests
Outcomes
Items
Tests
Outcomes
• Channel coding interpretation:
1 X1 Y 1
Y pp Xp
Encoder 1
Encoder p
Channel 1
Channel p
Decoder 1
Decoder p
• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 18/ 53
Example Results (Symmetric Noise)
• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)
I False pos/neg: Small #false positives/negatives allowed in reconstruction
0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)
0
0.1
0.2
0.3
0.4
0.5
Asy
mpto
tic
ratio
ofk
log2p=k
ton
Joint false pos/neg
Joint exact
False pos/neg
False neg
False pos
Exact
Best previous practical
Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 19/ 53
SDI Achievability Proof 1 (Notation)
• Define the defectivity indicator random variable
βj = 1j ∈ S
• Separate decoding of items: Decode βj based only on and Y
I Xj : j-th column of X
• Binary hypothesis testing based rule
βj (Xj ,Y) = 1
n∑i=1
logPY |Xj ,βj
(Y (i)|X (i)j , 1)
PY (Y (i))> γ
I X (i) : i-th row of X
I Y (i) : i-th test outcome
I γ : Threshold to be set later
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 20/ 53
SDI Achievability Proof 1 (Notation)
• Define the defectivity indicator random variable
βj = 1j ∈ S
• Separate decoding of items: Decode βj based only on and Y
I Xj : j-th column of X
• Binary hypothesis testing based rule
βj (Xj ,Y) = 1
n∑i=1
logPY |Xj ,βj
(Y (i)|X (i)j , 1)
PY (Y (i))> γ
I X (i) : i-th row of X
I Y (i) : i-th test outcome
I γ : Threshold to be set later
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 20/ 53
SDI Achievability Proof 2 (Non-Asymptotic Bound)
Non-asymptotic bound: Under Bernoulli testing, we have
Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ
where
ın1(Xj ,Y) :=n∑
i=1
ı1(X(i)j ,Y (i)), ı1(x1, y) := log
PY |X1(y |x1)
PY (y)
is the information density
Proof idea:
I First term: Union bound over k false negative events
I Second term: Union bound over p − k false positive events & change of measure
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 21/ 53
SDI Achievability Proof 2 (Non-Asymptotic Bound)
Non-asymptotic bound: Under Bernoulli testing, we have
Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ
where
ın1(Xj ,Y) :=n∑
i=1
ı1(X(i)j ,Y (i)), ı1(x1, y) := log
PY |X1(y |x1)
PY (y)
is the information density
Proof idea:
I First term: Union bound over k false negative events
I Second term: Union bound over p − k false positive events & change of measure
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 21/ 53
SDI Achievability Proof 3 (Asymptotic Analysis)
Simplification: If we have concentration of the form
P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)
then the following conditions suffice for Pe → 0:
n ≥log(
1δ1
(p − k))
I (X1;Y )(1− δ2)
k · ψn(δ2)→ 0
Notes
I Proved by setting γ = log p−kδ1
in initial result
I With false positives, can instead use γ = log p−kkδ1
I With false negatives, it suffices that ψn(δ2)→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 22/ 53
SDI Achievability Proof 3 (Asymptotic Analysis)
Simplification: If we have concentration of the form
P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)
then the following conditions suffice for Pe → 0:
n ≥log(
1δ1
(p − k))
I (X1;Y )(1− δ2)
k · ψn(δ2)→ 0
Notes
I Proved by setting γ = log p−kδ1
in initial result
I With false positives, can instead use γ = log p−kkδ1
I With false negatives, it suffices that ψn(δ2)→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 22/ 53
SDI Achievability Proof 4 (Concentration Bounds)
Bernstein’s inequality:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
( − 12· nk· c2
meanδ22
cvar + 13cmeancmaxδ2
)for suitable constants cmean, cvar, cmax
Refined concentration for noiseless model:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
(−
n log 2
k
((1− δ2) log(1− δ2) + δ2
)(1 + o(1))
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 23/ 53
SDI Achievability Proof 4 (Concentration Bounds)
Bernstein’s inequality:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
( − 12· nk· c2
meanδ22
cvar + 13cmeancmaxδ2
)for suitable constants cmean, cvar, cmax
Refined concentration for noiseless model:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
(−
n log 2
k
((1− δ2) log(1− δ2) + δ2
)(1 + o(1))
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 23/ 53
SDI Achievability Proof 5 (Final Conditions)
Final noiseless result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)2(1− δ2),
k log k
(log 2)((1− δ2) log(1− δ2) + δ2
)(1 + η)
Final noisy result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)(log 2− H2(ρ))(1− δ2),
(k log k) ·(cvar + 1
3cmeancmaxδ2
)12· c2
meanδ22
(1+η)
Approximate recovery: We have P(approx)e → 0 if
n ≥log p
k
I (X1;Y )(1 + η)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53
SDI Achievability Proof 5 (Final Conditions)
Final noiseless result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)2(1− δ2),
k log k
(log 2)((1− δ2) log(1− δ2) + δ2
)(1 + η)
Final noisy result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)(log 2− H2(ρ))(1− δ2),
(k log k) ·(cvar + 1
3cmeancmaxδ2
)12· c2
meanδ22
(1+η)
Approximate recovery: We have P(approx)e → 0 if
n ≥log p
k
I (X1;Y )(1 + η)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53
SDI Achievability Proof 5 (Final Conditions)
Final noiseless result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)2(1− δ2),
k log k
(log 2)((1− δ2) log(1− δ2) + δ2
)(1 + η)
Final noisy result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)(log 2− H2(ρ))(1− δ2),
(k log k) ·(cvar + 1
3cmeancmaxδ2
)12· c2
meanδ22
(1+η)
Approximate recovery: We have P(approx)e → 0 if
n ≥log p
k
I (X1;Y )(1 + η)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53
Example Results (Symmetric Noise)
• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)
I False pos/neg: Small #false positives/negatives allowed in reconstruction
0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)
0
0.1
0.2
0.3
0.4
0.5
Asy
mpto
tic
ratio
ofk
log2p=k
ton
Joint false pos/neg
Joint exact
False pos/neg
False neg
False pos
Exact
Best previous practical
Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 25/ 53
Discrete Example 2
Graphical Model Selection
Graphical Model Representations of Joint Distributions
Motivating example:
I In a population of p people, let
Yi =
1 person i is infected
−1 person i is healthy,i = 1, . . . , p
I Example models:
[Abbe and Wainwright, ISIT Tutorial, 2015]
I Joint distribution for a given graph G = (V ,E):
P[(Y1, . . . ,Yp) = (y1, . . . , yp)
]=
1
Zexp
( ∑(i,j)∈E
λijyiyj
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53
Graphical Model Representations of Joint Distributions
Motivating example:
I In a population of p people, let
Yi =
1 person i is infected
−1 person i is healthy,i = 1, . . . , p
I Example models:
[Abbe and Wainwright, ISIT Tutorial, 2015]
I Joint distribution for a given graph G = (V ,E):
P[(Y1, . . . ,Yp) = (y1, . . . , yp)
]=
1
Zexp
( ∑(i,j)∈E
λijyiyj
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53
Graphical Model Representations of Joint Distributions
Motivating example:
I In a population of p people, let
Yi =
1 person i is infected
−1 person i is healthy,i = 1, . . . , p
I Example models:
[Abbe and Wainwright, ISIT Tutorial, 2015]
I Joint distribution for a given graph G = (V ,E):
P[(Y1, . . . ,Yp) = (y1, . . . , yp)
]=
1
Zexp
( ∑(i,j)∈E
λijyiyj
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial, 2015]:I Example graphs:
I Sample images (Ising model):
I Goal: Identify graph given n independent samples
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
Graphical Model Selection: Definition
General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G
I Applications: Statistical physics, social and biological networks
I Error probability:Pe = max
G∈GP[G 6= G |G ].
Assumptions.I Distribution class:
I Ising model
PG (x1, . . . , xp)]
=1
Zexp
( ∑(i,j)∈E
λijxixj
)I Gaussian model
(X1, . . . ,Xp) ∼ N (µ,Σ)
where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]
I Graph class:
I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 28/ 53
Graphical Model Selection: Definition
General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G
I Applications: Statistical physics, social and biological networks
I Error probability:Pe = max
G∈GP[G 6= G |G ].
Assumptions.I Distribution class:
I Ising model
PG (x1, . . . , xp)]
=1
Zexp
( ∑(i,j)∈E
λijxixj
)I Gaussian model
(X1, . . . ,Xp) ∼ N (µ,Σ)
where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]
I Graph class:
I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 28/ 53
Information-Theoretic Viewpoint
• Information-theoretic viewpoint:
ChannelDecoder
G GY
PY|G
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 29/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)
I Ideally close together (harder to distinguish)
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (G ; G) + log 2
log |G0|
• Bounding the mutual information:
I Data processing inequality: I (G ; G) ≤ I (G ; Y)
I Tensorization: I (G ; Y) ≤∑n
i=1 I (G ;Yi )
I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)
I Ideally close together (harder to distinguish)
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (G ; G) + log 2
log |G0|
• Bounding the mutual information:
I Data processing inequality: I (G ; G) ≤ I (G ; Y)
I Tensorization: I (G ; Y) ≤∑n
i=1 I (G ;Yi )
I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)
I Ideally close together (harder to distinguish)
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (G ; G) + log 2
log |G0|
• Bounding the mutual information:
I Data processing inequality: I (G ; G) ≤ I (G ; Y)
I Tensorization: I (G ; Y) ≤∑n
i=1 I (G ;Yi )
I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53
Graph Ensembles
I Graphs that are difficult to distinguish from the empty graph:
I Reveals n = Ω(
1λ2 log p
)necessary condition with “edge strength” λ and p nodes
I Graphs that are difficult to distinguish from the complete (sub-)graph:
I Reveals n = Ω(eλd)
necessary condition with “edge strength” λ and degree d
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 31/ 53
Graph Ensembles
I Graphs that are difficult to distinguish from the empty graph:
I Reveals n = Ω(
1λ2 log p
)necessary condition with “edge strength” λ and p nodes
I Graphs that are difficult to distinguish from the complete (sub-)graph:
I Reveals n = Ω(eλd)
necessary condition with “edge strength” λ and degree d
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 31/ 53
Upper vs. Lower Bounds
• Example results with maximal degree d , edge strength λ (slightly informal):
I (Converse) n = Ω(
max
1λ2 , e
λd
log p)
[Santhanam and Wainwright, 2012]
I (Achievability) n = O(
max
1λ2 , e
λdd log p
)[Santhanam and Wainwright, 2012]
I (Early Practical) n = O(d2 log p) but extra assumptions that are hard to cerify[Ravikumar/Wainwright/Lafferty, 2010]
I (Recent Practical) n = O(d2eλd
λ2 log p)
[Klivans/Meka 2017]
I ...and more [Lokhov et al., 2018][Wu/Sanghavi/Dimakis 2018]
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 32/ 53
Graphical Model Selection with Twists
• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]
Pe(dmax) = maxG∈G
P[dH(E , E) > dmax |G ]
• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]
Decoder
Choose
Nodes
X(i)
GGSample
Y (i)
• Analogies
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53
Graphical Model Selection with Twists
• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]
Pe(dmax) = maxG∈G
P[dH(E , E) > dmax |G ]
• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]
Decoder
Choose
Nodes
X(i)
GGSample
Y (i)
• Analogies
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53
Graphical Model Selection with Twists
• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]
Pe(dmax) = maxG∈G
P[dH(E , E) > dmax |G ]
• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]
Decoder
Choose
Nodes
X(i)
GGSample
Y (i)
• Analogies
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53
What About Continuous-Valued Estimation?
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53
Minimax Risk
• Since the samples are random, so is θ and hence `(θ, θ)
• So seek to minimize the average loss Eθ[`(θ, θ)].
I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.
• Minimax risk:Mn(Θ, `) = inf
θsupθ∈Θ
Eθ[`(θ, θ)
],
i.e., worst case average loss over all θ ∈ Θ
• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53
Minimax Risk
• Since the samples are random, so is θ and hence `(θ, θ)
• So seek to minimize the average loss Eθ[`(θ, θ)].
I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.
• Minimax risk:Mn(Θ, `) = inf
θsupθ∈Θ
Eθ[`(θ, θ)
],
i.e., worst case average loss over all θ ∈ Θ
• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53
Minimax Risk
• Since the samples are random, so is θ and hence `(θ, θ)
• So seek to minimize the average loss Eθ[`(θ, θ)].
I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.
• Minimax risk:Mn(Θ, `) = inf
θsupθ∈Θ
Eθ[`(θ, θ)
],
i.e., worst case average loss over all θ ∈ Θ
• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53
General Lower Bound via Fano’s Inequality
• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form
`(θ, θ) = Φ(ρ(θ, θ)
)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function
(e.g., `(θ, θ) = ‖θ − θ‖2)
• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that
ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.
Then, we have
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 36/ 53
General Lower Bound via Fano’s Inequality
• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form
`(θ, θ) = Φ(ρ(θ, θ)
)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function
(e.g., `(θ, θ) = ‖θ − θ‖2)
• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that
ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.
Then, we have
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 36/ 53
Proof of General Lower Bound• Using Markov’s inequality:
supθ∈Θ
Eθ[`(θ, θ)
]≥ supθ∈Θ
Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]
= Φ(ε0) supθ∈Θ
Pθ[ρ(θ, θ) ≥ ε0]
• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and
ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2
then we must have V = v :
Pθv[ρ(θv , θ) ≥
ε
2
]≥ Pθv [V 6= v ].
• Hence,
supθ∈Θ
Pθ[ρ(θ, θ) ≥
ε
2
]≥ max
v=1,...,MPθv
[ρ(θv , θ) ≥
ε
2
]≥ max
v=1,...,MPθv [V 6= v ]
≥1
M
∑v=1,...,M
Pθv [V 6= v ]
≥ 1−I (V ; Y) + log 2
log M
where the final step uses Fano’s inequality.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53
Proof of General Lower Bound• Using Markov’s inequality:
supθ∈Θ
Eθ[`(θ, θ)
]≥ supθ∈Θ
Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]
= Φ(ε0) supθ∈Θ
Pθ[ρ(θ, θ) ≥ ε0]
• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and
ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2
then we must have V = v :
Pθv[ρ(θv , θ) ≥
ε
2
]≥ Pθv [V 6= v ].
• Hence,
supθ∈Θ
Pθ[ρ(θ, θ) ≥
ε
2
]≥ max
v=1,...,MPθv
[ρ(θv , θ) ≥
ε
2
]≥ max
v=1,...,MPθv [V 6= v ]
≥1
M
∑v=1,...,M
Pθv [V 6= v ]
≥ 1−I (V ; Y) + log 2
log M
where the final step uses Fano’s inequality.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53
Proof of General Lower Bound• Using Markov’s inequality:
supθ∈Θ
Eθ[`(θ, θ)
]≥ supθ∈Θ
Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]
= Φ(ε0) supθ∈Θ
Pθ[ρ(θ, θ) ≥ ε0]
• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and
ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2
then we must have V = v :
Pθv[ρ(θv , θ) ≥
ε
2
]≥ Pθv [V 6= v ].
• Hence,
supθ∈Θ
Pθ[ρ(θ, θ) ≥
ε
2
]≥ max
v=1,...,MPθv
[ρ(θv , θ) ≥
ε
2
]≥ max
v=1,...,MPθv [V 6= v ]
≥1
M
∑v=1,...,M
Pθv [V 6= v ]
≥ 1−I (V ; Y) + log 2
log M
where the final step uses Fano’s inequality.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53
Local Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Local approach: Carefully-chosen “local” hard subset:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
Resulting bound:
Mn(Θ, `) ≥ Φ( ε
2
)(1−
minv=1,...,M D(Pnθv‖Qn
Y ) + log 2
log M
).
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53
Local Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Local approach: Carefully-chosen “local” hard subset:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
Resulting bound:
Mn(Θ, `) ≥ Φ( ε
2
)(1−
minv=1,...,M D(Pnθv‖Qn
Y ) + log 2
log M
).
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53
Local Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Local approach: Carefully-chosen “local” hard subset:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
Resulting bound:
Mn(Θ, `) ≥ Φ( ε
2
)(1−
minv=1,...,M D(Pnθv‖Qn
Y ) + log 2
log M
).
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53
Global Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Global approach: Pack as many ε-separated points as possible:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)
• Resulting bound:
Mn(Θ, `) ≥ Φ( εp
2
)(1−
log N∗KL,n(Θ, εc,n) + εc,n + log 2
log M∗ρ (Θ, εp)
).
I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53
Global Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Global approach: Pack as many ε-separated points as possible:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)
• Resulting bound:
Mn(Θ, `) ≥ Φ( εp
2
)(1−
log N∗KL,n(Θ, εc,n) + εc,n + log 2
log M∗ρ (Θ, εp)
).
I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53
Global Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Global approach: Pack as many ε-separated points as possible:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)
• Resulting bound:
Mn(Θ, `) ≥ Φ( εp
2
)(1−
log N∗KL,n(Θ, εc,n) + εc,n + log 2
log M∗ρ (Θ, εp)
).
I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53
Continuous Example 1
Sparse Linear Regression
Sparse Linear Regression
• Linear regression model Y = Xθ + Z:
2 RpY 2 Rn X 2 Rnp
= +
Z 2 Rn
Samples Feature Matrix Coefficients Noise
p
n
I Feature matrix X is given, noise is i.i.d. N(0, σ2)
I Coefficients are sparse – at most k non-zeros
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 40/ 53
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
Upper vs. Lower Bounds
• Recap of model: Y = Xθ + Z, where θ is k-sparse
• Lower bound: If ‖X‖2F ≤ npΓ, achieving E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
• Upper bound: If X is a zero-mean random Gaussian matrix with power Γ per entry,
then we can achieve E[‖θ − θ‖22] ≤ δ using at most n ≥ c′σ2
Γδ· k log p
ksamples
I Maximum-likelihood estimation suffices
• Tighter lower bounds could potentially be obtained under additional restrictions on X
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 42/ 53
Continuous Example 2
Convex Optimization
Stochastic Convex Optimization
• A basic optimization problem
x? = argminx∈D f (x)
For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)
• Model:
I Noisy samples: When we query x , we get a noisy value and noisy gradient:
Y = f (x) + Z , Y ′ = f ′(x) + Z ′
where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)
I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1
• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.
I We will focus on the class of strongly convex functions
I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53
Stochastic Convex Optimization
• A basic optimization problem
x? = argminx∈D f (x)
For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)
• Model:
I Noisy samples: When we query x , we get a noisy value and noisy gradient:
Y = f (x) + Z , Y ′ = f ′(x) + Z ′
where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)
I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1
• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.
I We will focus on the class of strongly convex functions
I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53
Stochastic Convex Optimization
• A basic optimization problem
x? = argminx∈D f (x)
For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)
• Model:
I Noisy samples: When we query x , we get a noisy value and noisy gradient:
Y = f (x) + Z , Y ′ = f ′(x) + Z ′
where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)
I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1
• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.
I We will focus on the class of strongly convex functions
I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53
Performance Measure and Minimax Risk
• After sampling n points, the algorithm returns a final point x
• The loss incurred is `f (x) = f (x)−minx∈X f (x), i.e., the gap to the optimum
• For a given class of functions F , the minimax risk is given by
Mn(F) = infX
supf∈F
Ef [`f (X )]
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 44/ 53
Reduction to Multiple Hypothesis Testing
• The picture remains the same:
Index V Select
Algorithm
Samples
InferEstimate V
Index
Function fV
Functionxy
xSelection
I Successful optimization =⇒ Successful identification of V
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 45/ 53
General Minimax Lower Bound
• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have
Mn(F) ≥ ε ·(
1−I (V ; X,Y) + log 2
log M
), (1)
where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).
• Claim 2. In the special case M = 2, we have
Mn(F) ≥ ε · H−12
(log 2− I (V ; X,Y)
), (2)
where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.
• Proof is like with estimation, starting with Markov’s inequality:
supf∈F
Ef [`f (X )] ≥ supf∈F
ε · Pf [`f (X ) ≥ ε].
• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53
General Minimax Lower Bound
• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have
Mn(F) ≥ ε ·(
1−I (V ; X,Y) + log 2
log M
), (1)
where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).
• Claim 2. In the special case M = 2, we have
Mn(F) ≥ ε · H−12
(log 2− I (V ; X,Y)
), (2)
where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.
• Proof is like with estimation, starting with Markov’s inequality:
supf∈F
Ef [`f (X )] ≥ supf∈F
ε · Pf [`f (X ) ≥ ε].
• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53
General Minimax Lower Bound
• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have
Mn(F) ≥ ε ·(
1−I (V ; X,Y) + log 2
log M
), (1)
where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).
• Claim 2. In the special case M = 2, we have
Mn(F) ≥ ε · H−12
(log 2− I (V ; X,Y)
), (2)
where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.
• Proof is like with estimation, starting with Markov’s inequality:
supf∈F
Ef [`f (X )] ≥ supf∈F
ε · Pf [`f (X ) ≥ ε].
• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53
Strongly Convex Class: Choice of Hard Subset
• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!
I (Becomes 2constant×d in d dimensions)
0 0.2 0.4 0.6 0.8 1
Input x
0
0.05
0.1
0.15
Functionvalue
• The precise functions:
fv (x) =1
2(x − x∗v )2, v = 1, 2,
x∗1 =1
2−√
2ε′ x∗2 =1
2+√
2ε′
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 47/ 53
Strongly Convex Class: Choice of Hard Subset
• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!
I (Becomes 2constant×d in d dimensions)
0 0.2 0.4 0.6 0.8 1
Input x
0
0.05
0.1
0.15
Functionvalue
• The precise functions:
fv (x) =1
2(x − x∗v )2, v = 1, 2,
x∗1 =1
2−√
2ε′ x∗2 =1
2+√
2ε′
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 47/ 53
Analysis
• Application of Fano’s Inequality. As above,
Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))
I Approach: H−12 (α) ≥ 1
10if α ≥ log 2
2
I How few samples ensure I (V ; X,Y) ≤ log 22
?
• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1
2x2. Then:
D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2
2σ2+
(f ′1 (x)− f ′0 (x))2
2σ2
I Simplifications: (f1(x)− f0(x))2 ≤(ε+
√ε2
)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε
I With some manipulation, I (V ; X,Y) ≤ log 22
when ε = σ2 log 24n
• Final result: Mn(F) ≥ σ2 log 240n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53
Analysis
• Application of Fano’s Inequality. As above,
Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))
I Approach: H−12 (α) ≥ 1
10if α ≥ log 2
2
I How few samples ensure I (V ; X,Y) ≤ log 22
?
• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1
2x2. Then:
D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2
2σ2+
(f ′1 (x)− f ′0 (x))2
2σ2
I Simplifications: (f1(x)− f0(x))2 ≤(ε+
√ε2
)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε
I With some manipulation, I (V ; X,Y) ≤ log 22
when ε = σ2 log 24n
• Final result: Mn(F) ≥ σ2 log 240n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53
Analysis
• Application of Fano’s Inequality. As above,
Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))
I Approach: H−12 (α) ≥ 1
10if α ≥ log 2
2
I How few samples ensure I (V ; X,Y) ≤ log 22
?
• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1
2x2. Then:
D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2
2σ2+
(f ′1 (x)− f ′0 (x))2
2σ2
I Simplifications: (f1(x)− f0(x))2 ≤(ε+
√ε2
)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε
I With some manipulation, I (V ; X,Y) ≤ log 22
when ε = σ2 log 24n
• Final result: Mn(F) ≥ σ2 log 240n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53
Upper vs. Lower Bounds
• Lower bound for 1D strongly convex functions: Mn(F) ≥ c σ2
n
• Upper bound for 1D strongly convex functions: Mn(F) ≤ c ′ σ2
n
I Achieved by stochastic gradient descent
• Analogous results (and proof techniques) known for d-dimensional functions,additional Lipschitz assumptions, etc. [Raginsky and Rakhlin, 2011]
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 49/ 53
Limitations and Generalizations
• Limitations of Fano’s Inequality.
I Non-asymptotic weakness
I Typically hard to tightly bound mutual information in adaptive settings
I Restriction to KL divergence
• Generalizations of Fano’s Inequality.
I Non-uniform V [Han/Verdu, 1994]
I More general f -divergences [Guntuboyina, 2011]
I Continuous V [Duchi/Wainwright, 2013]
(This list is certainly incomplete!)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 50/ 53
Example: Difficulties in Adaptive Settings
• A simple search problem: Find the (only) biased coin using few flips
| z | z P[heads] = 1
2 P[heads] = 12
P[heads] = 12 +
I Heavy coin V ∈ 1, . . . ,M uniformly at random
I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)
• Non-adaptive setting:
I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2
M
I Substituting into Fano’s inequality gives the requirement n & M log Mε2
• Adaptive setting:
I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!
I Worst-case bounding only gives n & log Mε2
I Next lecture: An alternative tool that gives n & Mε2
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53
Example: Difficulties in Adaptive Settings
• A simple search problem: Find the (only) biased coin using few flips
| z | z P[heads] = 1
2 P[heads] = 12
P[heads] = 12 +
I Heavy coin V ∈ 1, . . . ,M uniformly at random
I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)
• Non-adaptive setting:
I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2
M
I Substituting into Fano’s inequality gives the requirement n & M log Mε2
• Adaptive setting:
I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!
I Worst-case bounding only gives n & log Mε2
I Next lecture: An alternative tool that gives n & Mε2
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53
Example: Difficulties in Adaptive Settings
• A simple search problem: Find the (only) biased coin using few flips
| z | z P[heads] = 1
2 P[heads] = 12
P[heads] = 12 +
I Heavy coin V ∈ 1, . . . ,M uniformly at random
I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)
• Non-adaptive setting:
I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2
M
I Substituting into Fano’s inequality gives the requirement n & M log Mε2
• Adaptive setting:
I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!
I Worst-case bounding only gives n & log Mε2
I Next lecture: An alternative tool that gives n & Mε2
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53
Conclusion
• Information theory as a theory of data:
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Approach highlighted in this talk:
I Reduction to multiple hypothesis testing
I Application of Fano’s inequality
I Bounding the mutual information
• Examples:
I Group testing
I Graphical model selection
I Sparse regression
I Convex optimization
I ...hopefully many more to come!
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53
Conclusion
• Information theory as a theory of data:
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Approach highlighted in this talk:
I Reduction to multiple hypothesis testing
I Application of Fano’s inequality
I Bounding the mutual information
• Examples:
I Group testing
I Graphical model selection
I Sparse regression
I Convex optimization
I ...hopefully many more to come!
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53
Conclusion
• Information theory as a theory of data:
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Approach highlighted in this talk:
I Reduction to multiple hypothesis testing
I Application of Fano’s inequality
I Bounding the mutual information
• Examples:
I Group testing
I Graphical model selection
I Sparse regression
I Convex optimization
I ...hopefully many more to come!
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53
Tutorial Chapter
• Tutorial Chapter: “An Introductory Guide to Fano’s Inequalitywith Applications in Statistical Estimation” [S. and Cevher, 2019]
https://arxiv.org/abs/1901.00555
(To appear in book Information-Theoretic Methods in DataScience, Cambridge University Press)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 53/ 53