Fast rates of convergence for learning problems...Prof. John Duchi Bounding the local Rademacher...

24
Fast rates of convergence for learning problems John Duchi Prof. John Duchi

Transcript of Fast rates of convergence for learning problems...Prof. John Duchi Bounding the local Rademacher...

  • Fast rates of convergence for learning problems

    John Duchi

    Prof. John Duchi

  • Outline

    I Mean estimation and uniform laws

    II Convexity

    III Growth conditions and fast rates

    Prof. John Duchi

  • Best rates from uniform laws

    What do our uniform laws give us?

    bh = argminh2H

    (

    bLn

    (h) =1

    n

    n

    X

    i=1

    `(h;Xi

    )

    )

    and

    sup

    h2H|

    bLn

    (h)� L(h)| . 1pn

    with high probability

    Best this can be?

    L(bh)� L(h?) = bLn

    (

    bh)� bLn

    (h?) + L(bh)� bLn

    (

    bh) + bLn

    (h?)� bLn

    (h?)

    Prof. John Duchi

  • The best rate from this approach

    Central limit theorems: Consider the third error term involving h?:

    p

    n⇣

    L(h?)� bLn

    (h?)⌘

    =

    1

    p

    n

    n

    X

    i=1

    [L(h?)� `(h?;Xi

    )]

    d N (0,Var(`(h?;X)))

    Is this right?

    Prof. John Duchi

  • Estimating a mean

    Goal: We want to estimate ✓? = E[X], use loss

    `(✓;x) =1

    2

    (✓ � x)2

    with risk

    L(✓) =1

    2

    E[(✓ �X)2] = 12

    E[(✓ � E[X] + E[X]�X)2]

    =

    1

    2

    (✓ � E[X])2 +Var(X).

    Gap in risks: Subtracting we have

    L(✓)� L(✓?) =1

    2

    (✓ � ✓?)2 +Var(X)�Var(X) =1

    2

    (✓ � ✓?)2

    Prof. John Duchi

  • Estimating a sub-Gaussian mean

    Let Xi

    be independent �2-sub-Gaussian, so that

    b✓n

    =

    1

    n

    n

    X

    i=1

    Xi

    = argmin

    bLn

    (✓)

    and for t � 0 we have

    P(|b✓n

    � ✓?| � t) 2 exp

    nt2

    2�2

    LemmaWith probability at least 1� �, we have

    L(b✓n

    )� L(✓?) =1

    2

    (

    b✓n

    � ✓?)2 C�2 log 1

    n.

    Prof. John Duchi

  • Uniform law for means?

    sup

    ✓2R

    n

    bLn

    (✓)� L(✓)o

    = +1

    Prof. John Duchi

  • Convexity: heuristic and graphical explanation

    Prof. John Duchi

  • Convexity: definitions

    DefinitionA function f : Rd ! R is convex if

    f(�u+ (1� �)v) �f(u) + (1� �)f(v)

    for all u, v 2 Rd and � 2 [0, 1]

    Prof. John Duchi

  • Basic properties

    A few properties of convex functions

    I If f : R ! R is twice di↵erentiable, then f is convex if andonly if f 00(t) � 0

    I If f : Rm ! R is convex and A 2 Rm⇥n, b 2 Rm, theng(x) := f(Ax+ b) is convex

    I If f1, f2 are convex, then f1 + f2 is convex

    Prof. John Duchi

  • Examples

    ExampleLogistic loss: �(t) = log(1 + e�t) and`(✓;x, y) = log(1 + e�yxT ✓) = �(yxT ✓)

    ExampleAny norm k·k.

    Example`1-regularized linear regression:

    1

    2nkX✓ � yk22 + � k✓k1 .

    Prof. John Duchi

  • Convex functions have no local minima

    TheoremLet B = {✓ : k✓k 1}, S = {✓ : k✓k = 1}, suppose f is convexand satisfies

    f(✓) � f(✓?) for ✓ 2 ✓? + ✏S.

    For ✓ 62 ✓? + ✏B, define

    ✓✏

    :=

    k✓ � ✓?k✓ +

    1�

    k✓ � ✓?k

    ✓?

    Then

    f(✓)� f(✓?) �k✓ � ✓?k

    ✏[f(✓

    )� f(✓?)]

    Prof. John Duchi

  • Proof of theorem

    Note that ✓✏

    2 ✓? + ✏S, so for t = ✏k✓�✓?k 1,

    ✓✏

    =

    k✓ � ✓?k✓ +

    1�

    k✓ � ✓?k

    ✓? = t✓ + (1� t)✓?,

    andf(✓

    ) tf(✓) + (1� t)f(✓?)

    Prof. John Duchi

  • Convex loss functions

    Suppose that we use a convex loss, i.e. `(✓;X) is convex in ✓.Then

    bLn

    (✓) > bLn

    (✓?) for all ✓ 2 ✓? + ✏S

    implies that

    b✓n

    = argmin

    bLn

    (✓) satisfies kb✓n

    � ✓?k ✏

    Prof. John Duchi

  • A picture of how we achieve fast rates

    Prof. John Duchi

  • Growth and smoothness

    Let us fix some radius r > 0, and assume

    [Growth] L(✓) � L(✓?) +�

    2

    k✓ � ✓?k2 for k✓ � ✓?k r

    and that

    [Smoothness] `(·;x) is M -Lipschitz on {✓ : k✓ � ✓?k r}

    Example (linear regression with bounded x)

    Prof. John Duchi

  • Fast rates under growth conditions

    TheoremLet the conditions on growth and smoothness hold, define

    := {✓ 2 ⇥ | k✓ � ✓?k ✏}

    and the localized Rademacher complexity

    Rn

    (⇥

    ) := E"

    sup

    ✓2⇥✏

    1

    n

    n

    X

    i=1

    "i

    [`(✓;Xi

    )� `(✓?;Xi

    )]

    #

    Fix t � 0 and choose any ✏ r such that

    �✏2

    2

    � 2Rn

    (⇥

    ) +

    p

    2

    Mp

    nt · ✏.

    Then

    P⇣

    k

    b✓ � ✓k � ✏⌘

    2e�nt2

    Prof. John Duchi

  • Bounding the local Rademacher complexity

    Under the conditions of the theorem, for ⇥ ⇢ Rd we have

    k

    b✓ � ✓k CM

    �p

    n

    p

    d+ t⌘

    w.p. � 1� 2e�nt2.

    Prof. John Duchi

  • Bounding the local Rademacher complexity

    Under the conditions of the theorem, for ⇥ ⇢ Rd we have

    L(b✓)� L(✓?) O(stu↵) log 1

    nw.p. � 1� �.

    Prof. John Duchi

  • Multiclass classification

    Suppose we have multiclass logistic loss for ✓ = [✓1 · · · ✓k

    ],✓l

    2 Rd, y 2 {1, . . . , k}, kxk2 M

    `(✓;x, y) = log

    k

    X

    l=1

    exp

    xT (✓l

    � ✓y

    )

    !

    .

    Then

    Rn

    (⇥

    ) . Mp

    dkp

    n✏

    Prof. John Duchi

  • Proof of theorem

    Part 1: Consider the event bLn

    (✓) bLn

    (✓?) for some ✓ 2 ⇥✏

    ,which implies

    bLn

    (✓)� L(✓)⌘

    bLn

    (✓?)� L(✓?)⌘

    2

    ✏2

    Prof. John Duchi

  • Proof of theoremPart 2: Consider localized excess risk for ✓ 2 ⇥

    n

    X

    i=1

    [(`(✓;Xi

    )� `(✓?;Xi

    ))� (L(✓)� L(✓?))]

    and get (for all t � 0)

    P✓

    sup

    ✓2⇥✏

    bLn

    (✓)� L(✓)� (bLn

    (✓?)� L(✓?))�

    � 2Rn

    (⇥

    ) + t

    2 exp

    nt2

    2M2✏2

    Prof. John Duchi

  • Proof of theorem

    Part 3: Implications: kb✓ � ✓?k � ✏

    )

    bLn

    (✓)� bLn

    (✓?) 0 for some ✓ 2 ⇥✏

    ) sup

    ✓2⇥✏

    bLn

    (✓)� L(✓)� (bLn

    (✓?)� L(✓?))�

    �✏2

    2

    ) sup

    ✓2⇥✏

    bLn

    (✓)� L(✓)� (bLn

    (✓?)� L(✓?))�

    � 2Rn

    (⇥

    ) +

    p

    2

    Mp

    nt · ✏

    Prof. John Duchi

  • Reading and bibliography

    1. P. Bartlett, O. Bousquet, and S. Mendelson. Local rademachercomplexities.Annals of Statistics, 33(4):1497–1537, 2005

    2. A. W. van der Vaart and J. A. Wellner. Weak Convergence andEmpirical Processes: With Applications to Statistics.Springer, New York, 1996 (Ch. 3.2–3.4)

    3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory ofclassification: a survey of some recent advances.ESAIM: Probability and Statistics, 9:323–375, 2005 (§5.3)

    4. A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures onStochastic Programming: Modeling and Theory.SIAM and Mathematical Programming Society, 2009 (Ch. 5.3)

    Prof. John Duchi