Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point...

9
OOPS 2020 Mean field methods in high-dimensional statistics and nonconvex optimization Lecturer: Andrea Montanari Problem session leader: Michael Celentano July 7, 2020 Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B (g, h) := min u2R d max v2R n 1 n kvkhg, ui + 1 n kukhh, vi- σ n hw, vi- 1 2n kvk 2 + λ p n k0 + uk 1 . (1) In lecture, we claimed that by analyzing Gordon’s objective we can show that the Lasso solution is described in terms of the solutions , β to the fixed point equations 2 = σ 2 + 1 δ E ((+ Z ; ⌧λ/β) - ) 2 , β = 1 - 1 δ E[0 (+ Z ; ⌧λ/β) , where is the solution of the 1-dimensional problem (y; ) := arg min x2R 1 2 (y - x) 2 + |x| =(|x| - ) + sign(x). is commonly known as soft-thresholding. In this problem, we will outline how to derive the fixed point equations from Gordon’s min-max problem.

Transcript of Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point...

Page 1: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

OOPS 2020Mean field methods in high-dimensional statistics and nonconvex optimizationLecturer: Andrea MontanariProblem session leader: Michael CelentanoJuly 7, 2020

Problem Session 1

Problem 1: from Gordon’s objective to the fixed point equations

Recall Gordon’s min-max problem is

B⇤(g,h) := min

u2Rdmaxv2Rn

⇢1

nkvkhg,ui+ 1

nkukhh,vi � �

nhw,vi � 1

2nkvk2 + �p

nk✓0 + uk1

�. (1)

In lecture, we claimed that by analyzing Gordon’s objective we can show that the Lasso solution is described

in terms of the solutions ⌧⇤,�⇤to the fixed point equations

⌧2 = �2+

1

�E⇥(⌘(⇥+ ⌧Z; ⌧�/�)�⇥)

2⇤,

� = ⌧

✓1� 1

�E[⌘0(⇥+ ⌧Z; ⌧�/�)

◆,

where ⌘ is the solution of the 1-dimensional problem

⌘(y;↵) := argminx2R

⇢1

2(y � x)2 + ↵|x|

�= (|x|� ↵)+sign(x).

⌘ is commonly known as soft-thresholding. In this problem, we will outline how to derive the fixed point

equations from Gordon’s min-max problem.

Page 2: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(a) Define ✓0 =pn✓0 and u =

pnu. Prove that B⇤

(g,h) has the same distribution as

minu2Rd

max��0

⇢✓����kukpn

hpn� �wp

n

����� 1

nhg,ui

◆� � 1

2�2

+�

nk✓0 + uk1

�. (2)

Hint: Let � = kvk/pn

Page 3: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(b) Argue (heuristically) that we may approximate the optimization above by

minu2Rd

max��0

( rkuk2n

+ �2 � 1

nhg,ui

!� � 1

2�2

+�

nk✓0 + uk1

). (3)

Remark: If we maximize over � � 0 explicitly, we get

minu2U

8<

:1

2

rkuk2n

+ �2 � hg, uin

!2

+

+�

nk✓0 + uk1

9=

; .

Note that the objective on the right-hand side is locally strongly convex around any point u at which

the first term is positive. When n < p, the Lasso objective is nowhere locally strongly convex. This

convenient feature of the new form of Gordon’s problem is very useful for its analysis. We do not explore

this further here.

Page 4: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(c) Argue that the quantity in Eq. (3) is equal to

max��0

min⌧�0

⇢�2�

2⌧+

⌧�

2� �2

2+

1

nminu2Rd

⇢�

2⌧kuk2 � �hg, ui+ �k✓0 + uk1

��. (4)

Hint: Recall the identitypx = min⌧�0

�x2⌧ +

⌧2

.

Page 5: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(d) Write

bu := arg minu2Rd

⇢�

2⌧kuk2 � �hg, ui+ �k✓0 + uk1

in terms of the soft-thresholding operator.

(e) Compute

(i) the derivative of minu2Rd

n�2⌧ kuk

2 � �hg, ui+ �k✓0 + uk1owith respect to ⌧ .

(ii) the derivative of minu2Rd

n�2⌧ kuk

2 � �hg, ui+ �k✓0 + uk1owith respect to �.

Page 6: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(f) Write1nE[kbuk

2] and

1nE[hg, bui] as an expectation over the random variables (⇥, Z) ⇠ bµ✓0 ⌦N(0, 1). For

the latter, rewrite it using Gaussian integration by parts.

(g) Take the derivative of the objective in Eq. (4) with respect to ⌧ and with respect to �. Show that setting

the expectations of these derivatives to 0 is equivalent to the fixed point equations.

Page 7: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

Problem 2: Gordon’s objective for max-margin classification

The use of Gordon’s technique extends well beyond linear models. In logistic regression, we receive iid

samples according to

yi ⇠ Rad(f(hxi,✓0i)), xi ⇠ N(0, Id),

f(x) =exp(x)

exp(x) + exp(�x).

For a certain �⇤ > 0 the following occurs: when n/p ! � < �⇤, n, p ! 1, with high probability there exists ✓such that yihxi,✓i > 0 for all i = 1, . . . , n. In such a regime, the data is linearly separable. The max-margin

classifier is defined as

b✓ 2 argmax✓

⇢minin

yihxi,✓i : k✓k 1

�, (5)

and the value of the optimization problem, which we denote by (y,X), is called the maximum margin.To simplify notation, we assume in this problem that k✓0k = 1. In this problem, we outline how to set up

Gordon’s problem for max-margin classification. The analysis of Gordon’s objective is complicated, and we

do not describe it here. See Montanari, Ruan, Sohn, Yan (2019+). “The generalization error of max-margin

linear classifiers: High-dimensional asymptotics in the overparameterized regime.” arxiv:1911.01544.

Page 8: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(a) Show that

(y,X) � if and only if mink✓k1

k(1� (y �X✓))+k2 = 0.

Argue that

mink✓k1

1pdk(1� (y �X✓))+k2 = min

k✓k1max

k�k1,��0

1pd�>

(1� y �X✓)

= mink✓k1

maxk�k1,��y�0

1pd�>

(y �X✓).

(b) Why can’t we use Gordon’s inequality to compare the preceding min-max problem to

mink✓k1

maxk�k1,y���0

1pd(�>y + k�kg>✓ + k✓kh>�)?

Page 9: Problem Session 1 · Problem Session 1 Problem 1: from Gordon’s objective to the fixed point equations Recall Gordon’s min-max problem is B⇤(g,h):= min u 2Rd max v Rn ⇢ 1

(c) Let x = X✓0. Show that the min-max problem is equivalent to

mink✓k1

maxk�k1,��0

1pd�>

(1� (y � x)h✓0,✓i � y �X⇧✓?0✓).

Here ⇧✓?0is the projection operator onto the orthogonal complement of the space spaned by ✓0. Argue

that

P✓

mink✓k1

maxk�k1,��0

1pd�>

(1� (y � x)h✓0,✓i � y �X⇧✓?0✓) t

2P✓

mink✓k1

maxk�k1,��0

1pd

⇣�>

(1� (y � x)h✓0,✓i) + k�kg>⇧✓?

0✓ + k⇧✓?

0✓kh>�

⌘ t

◆,

and likewise for the comparison of the probabilities that the min-max values exceed t, where g ⇠ N(0, Id)and h ⇠ N(0, In) independent of everything else.

(d) What is the limit in Wasserstein-2 distance of1n

Pni=1 �(yi,xi,hi)?