Adaptive Subgradient Methods for Online Learning and ...jduchi/projects/DuchiHaSi12_ismp.pdfFinal...

Transcript of Adaptive Subgradient Methods for Online Learning and ...jduchi/projects/DuchiHaSi12_ismp.pdfFinal...

Adaptive Subgradient Methods

for

Online Learning and Stochastic Optimization

John C. Duchi1,2 Elad Hazan3 Yoram Singer2

1University of California, Berkeley

2Google Research

3Technion

International Symposium on Mathematical Programming 2012

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 1 / 32
Setting: Online Convex Optimization

Online learning task—repeat:

• Learner plays point xt• Receive function ft• Suffer lossft(xt) + ϕ(xt)

• Parameter vector for features• Receive label yt, features φt• Suffer regularized logistic loss

log [1 + exp(−yt 〈φt, xt〉)] + λ ‖xt‖1

Goal: Attain small regret

T∑t=1

ft(xt) + ϕ(xt)− infx∈X

[ T∑t=1

ft(x) + ϕ(x)

]







T∑t=1


[ T∑t=1

ft(x) + ϕ(x)

]







T∑t=1


[ T∑t=1

ft(x) + ϕ(x)

]

Motivation

Text data:

The most unsung birthday

in American business and

technological history

this year may be the 50th

anniversary of the Xerox

914 photocopier.a

aThe Atlantic, July/August 2010.

High-dimensional image features

Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

Motivation

Text data:






914 photocopier.a




Motivation

Text data:






914 photocopier.a




Goal?Flipping around the usual sparsity game

minx. ‖Ax− b‖ , A = [a1 a2 · · · an]> ∈ Rn×d

Usually in sparsity-focused depend on

‖ai‖∞︸︷︷︸dense

· ‖x‖1︸︷︷︸sparse

What we would like:

‖ai‖1︸︷︷︸sparse

· ‖x‖∞︸︷︷︸dense

(In general, impossible)






What we would like:









What we would like:




Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):

xt+1 = argminx∈X

{1

2‖x− xt‖2 + ηt 〈gt, x〉

}or

xt+1 = argminx∈X

{ηtt

t∑τ=1

〈gτ , x〉+1

2t‖x‖2

}

f (x)

f (xt) + 〈gt, x - xt〉

Approaches: Gradient Descent and Dual AveragingLet gt ∈ ∂ft(xt):

xt+1 = argminx∈X

{1

2‖x− xt‖2 + ηt 〈gt, x〉

}or

xt+1 = argminx∈X

{ηtt

t∑τ=1

〈gτ , x〉+1

2t‖x‖2

}

〈gt, x〉 + 12 ||x - xt||2

What is the problem?

• Gradient steps treat all features as equal

• They are not!

Adapting to Geometry of Space

Why adapt to geometry?

Hard

Nice


Hard

Nice

yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive



Hard

Nice

yt φt,1 φt,2 φt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant



Adapting to Geometry of the Space

• Receive gt ∈ ∂ft(xt)• Earlier:

xt+1 = argminx∈X

{1

2‖x− xt‖2 + η 〈gt, x〉

}

• Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use

xt+1 = argminx∈X

{1

2‖x− xt‖2A + η 〈gt, x〉

}

Adapting to Geometry of the Space

• Receive gt ∈ ∂ft(xt)• Earlier:

xt+1 = argminx∈X

{1

2‖x− xt‖2 + η 〈gt, x〉

}

• Now: let ‖x‖2A = 〈x,Ax〉 for A � 0. Use

xt+1 = argminx∈X

{1

2‖x− xt‖2A + η 〈gt, x〉

}

Regret Bounds

What does adaptation buy?

• Standard regret bound:

T∑t=1

ft(xt)− ft(x∗) ≤1

2η‖x1 − x∗‖22 +

η

2

T∑t=1

‖gt‖22

• Regret bound with matrix:

T∑t=1


2η‖x1 − x∗‖2A +

η

2

T∑t=1

‖gt‖2A−1

Regret Bounds

What does adaptation buy?

• Standard regret bound:

T∑t=1


2η‖x1 − x∗‖22 +

η

2

T∑t=1

‖gt‖22

• Regret bound with matrix:

T∑t=1


2η‖x1 − x∗‖2A +

η

2

T∑t=1

‖gt‖2A−1

Meta Learning Problem

• Have regret:

T∑t=1


η‖x1 − x∗‖2A +

η

2

T∑t=1

‖gt‖2A−1

• What happens if we minimize A in hindsight?

minA

T∑t=1

〈gt, A

−1gt〉

subject to A � 0, tr(A) ≤ C

Meta Learning Problem• What happens if we minimize A in hindsight?

minA

T∑t=1

〈gt, A

−1gt〉


• Solution is of form

A = c diag

( T∑t=1

gtg>t

) 12

A = c

( T∑t=1

gtg>t

) 12

(diagonal) (full)

(where c chosen to satisfy tr constraint)

• Let g1:t,j be vector of jth gradient component. Optimal:

Aj,j ∝ ‖g1:T , j‖2


minA

T∑t=1

〈gt, A

−1gt〉



A = c diag

( T∑t=1

gtg>t

) 12

A = c

( T∑t=1

gtg>t

) 12

(diagonal) (full)



Aj,j ∝ ‖g1:T , j‖2


minA

T∑t=1

〈gt, A

−1gt〉



A = c diag

( T∑t=1

gtg>t

) 12

A = c

( T∑t=1

gtg>t

) 12

(diagonal) (full)



Aj,j ∝ ‖g1:T , j‖2Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 12 / 32
Low regret to the best A

• Let g1:t,j be vector of jth gradient component. At time t, use

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At + η 〈gt, x〉

}

• Example:

yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0

s1 =√

3.5 s2 =√

2 s3 = 1


• Let g1:t,j be vector of jth gradient component. At time t, use

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At + η 〈gt, x〉

}

• Example:

yt gt,1 gt,2 gt,31 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 1 1 0-1 1 0 0

s1 =√

3.5 s2 =√

2 s3 = 1

Final Convergence GuaranteeAlgorithm: at time t, set

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At + η 〈gt, x〉

}Define radius

R∞ := maxt‖xt − x∗‖∞ ≤ sup

x∈X‖x− x∗‖∞ .

TheoremThe final regret bound of AdaGrad:

T∑t=1

ft(xt)− ft(x∗) ≤ 2R∞d∑i=1

‖g1:T,j‖2 .

Final Convergence GuaranteeAlgorithm: at time t, set

st =[‖g1:t,j‖2

]dj=1

and At = diag(st)

xt+1 = argminx∈X

{1

2‖x− xt‖2At + η 〈gt, x〉

}Define radius

R∞ := maxt‖xt − x∗‖∞ ≤ sup

x∈X‖x− x∗‖∞ .

TheoremThe final regret bound of AdaGrad:

T∑t=1

ft(xt)− ft(x∗) ≤ 2R∞d∑i=1

‖g1:T,j‖2 .

Understanding the convergence guarantees I

• Stochastic convex optimization:

f(x) := EP [f(x; ξ)]

Sample ξt according to P , define ft(x) := f(x; ξt). Then

E[f

(1

T

T∑t=1

xt

)]− f(x∗) ≤ 2R∞

T

d∑i=1

E[‖g1:T,j‖2

]

Understanding the convergence guarantees I

• Stochastic convex optimization:

f(x) := EP [f(x; ξ)]

Sample ξt according to P , define ft(x) := f(x; ξt). Then

E[f

(1

T

T∑t=1

xt

)]− f(x∗) ≤ 2R∞

T

d∑i=1

E[‖g1:T,j‖2

]

Understanding the convergence guarantees IISupport vector machine example: define

f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d

• If ξj 6= 0 with probability ∝ j−α for α > 1

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·max

{log d, d1−α/2

})

• Previously best-known method:

E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·√d

).


f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d


E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O


{log d, d1−α/2

})


E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·√d

).


f(x; ξ) = [1− 〈x, ξ〉]+ , where ξ ∈ {−1, 0, 1}d


E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O


{log d, d1−α/2

})


E[f

(1

T

T∑t=1

xt

)]− f(x∗) = O

(‖x∗‖∞√T·√d

).

Understanding the convergence guarantees III

Back to regret minimization

• Convergence almost as good as that of the best geometrymatrix:

T∑t=1

ft(xt)− ft(x∗)

≤ 2√d ‖x∗‖∞

√√√√infs

{ T∑t=1

‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d}

• This (and other bounds) are minimax optimal

Understanding the convergence guarantees III

Back to regret minimization

• Convergence almost as good as that of the best geometrymatrix:

T∑t=1

ft(xt)− ft(x∗)

≤ 2√d ‖x∗‖∞

√√√√infs

{ T∑t=1

‖gt‖2diag(s)−1 : s � 0, 〈11, s〉 ≤ d}

• This (and other bounds) are minimax optimal

The AdaGrad AlgorithmsAnalysis applies to several algorithms

st =[‖g1:t,j‖2

]dj=1

, At = diag(st)

• Forward-backward splitting (Lions and Mercier 1979, Nesterov2007, Duchi and Singer 2009, others)

xt+1 = argminx∈X

{1

2‖x− xt‖2At + 〈gt, x〉+ ϕ(x)

}

• Regularized Dual Averaging (Nesterov 2007, Xiao 2010)

xt+1 = argminx∈X

{1

t

t∑τ=1

〈gτ , x〉+ ϕ(x) +1

2t‖x‖2At

}


st =[‖g1:t,j‖2

]dj=1

, At = diag(st)


xt+1 = argminx∈X

{1

2‖x− xt‖2At + 〈gt, x〉+ ϕ(x)

}


xt+1 = argminx∈X

{1

t

t∑τ=1

〈gτ , x〉+ ϕ(x) +1

2t‖x‖2At

}


st =[‖g1:t,j‖2

]dj=1

, At = diag(st)


xt+1 = argminx∈X

{1

2‖x− xt‖2At + 〈gt, x〉+ ϕ(x)

}


xt+1 = argminx∈X

{1

t

t∑τ=1

〈gτ , x〉+ ϕ(x) +1

2t‖x‖2At

}

An Example and Experimental Results

• `1-regularization

• Text classification

• Image ranking

• Neural network learning

AdaGrad with composite updates

Recall more general problem:

T∑t=1

f(xt) + ϕ(xt)− infx∗∈X

[ T∑t=1

f(x) + ϕ(x)

]

• Must solve updates of form

argminx∈X

{〈g, x〉+ ϕ(x) + 1

2‖x‖2A

}

• Luckily, often still simple



T∑t=1


[ T∑t=1

f(x) + ϕ(x)

]


argminx∈X

{〈g, x〉+ ϕ(x) + 1

2‖x‖2A

}




T∑t=1


[ T∑t=1

f(x) + ϕ(x)

]


argminx∈X

{〈g, x〉+ ϕ(x) + 1

2‖x‖2A

}


AdaGrad with `1 regularizationSet ḡt =

1t

∑tτ=1 gτ . Need to solve

minx〈ḡt, x〉+ λ ‖x‖1 +

1

2t〈x, diag(st)x〉

• Coordinate-wise update yields sparsity and adaptivity:

xt+1,j = sign(−ḡt,j)t

‖g1:t,j‖2[|ḡt,j| − λ]+

0 50 100 150 200 250 300−0.6

−0.4

−0.2

0

ḡt

λ truncation

t

AdaGrad with `1 regularizationSet ḡt =

1t

∑tτ=1 gτ . Need to solve

minx〈ḡt, x〉+ λ ‖x‖1 +

1

2t〈x, diag(st)x〉

• Coordinate-wise update yields sparsity and adaptivity:

xt+1,j = sign(−ḡt,j)t

‖g1:t,j‖2[|ḡt,j| − λ]+

0 50 100 150 200 250 300−0.6

−0.4

−0.2

0

ḡt

λ truncation

t

Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document

ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample

FOBOS AdaGrad PA1 AROW2

Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061

Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039

Test set classification error rate(sparsity of final predictor in parenthesis)

1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32
Text ClassificationReuters RCV1 document classification task—d = 2 · 106 features,approximately 4000 non-zero features per document

ft(x) := [1− 〈x, ξt〉]+where ξt ∈ {−1, 0, 1}d is data sample

FOBOS AdaGrad PA1 AROW2

Ecomonics .058 (.194) .044 (.086) .059 .049Corporate .111 (.226) .053 (.105) .107 .061

Government .056 (.183) .040 (.080) .066 .044Medicine .056 (.146) .035 (.063) .053 .039

Test set classification error rate(sparsity of final predictor in parenthesis)

1Crammer et al., 20062Crammer et al., 2009Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 22 / 32
Image Ranking

ImageNet (Deng et al., 2009), large-scale hierarchical image database

Fish

Vertebrate

Animal

Invertebrate

Mammal

Cow

Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)

Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images

f(x; z1, z2) =[1−

〈x, z1 − z2

〉]+

Image Ranking

ImageNet (Deng et al., 2009), large-scale hierarchical image database

Fish

Vertebrate

Animal

Invertebrate

Mammal

Cow

Train 15,000 rankers/classifiers torank images for each noun (as inGrangier and Bengio, 2008)

Dataξ = (z1, z2) ∈ {0, 1}d × {0, 1}d ispair of images

f(x; z1, z2) =[1−

〈x, z1 − z2

〉]+

Image Ranking Results

Precision at k: proportion of examples in top k that belong tocategory. Average precision is average placement of all positiveexamples.

Algorithm Avg. Prec. P@1 P@5 P@10 Nonzero

AdaGrad 0.6022 0.8502 0.8130 0.7811 0.7267AROW 0.5813 0.8597 0.8165 0.7816 1.0000

PA 0.5581 0.8455 0.7957 0.7576 1.0000Fobos 0.5042 0.7496 0.6950 0.6545 0.8996

Neural Network Learning

Wildly non-convex problem:

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))

where

p(α) =1

1 + exp(α)

⇠1 ⇠2 ⇠3 ⇠4⇠5

x1 x2 x3 x4 x5

p(hx1, ⇠1i)

Idea: Use stochastic gradient methods to solve it anyway


Wildly non-convex problem:

f(x; ξ) = log (1 + exp (〈[p(〈x1, ξ1〉) · · · p(〈xk, ξk〉)], ξ0〉))

where

p(α) =1

1 + exp(α)

⇠1 ⇠2 ⇠3 ⇠4⇠5

x1 x2 x3 x4 x5

p(hx1, ⇠1i)

Idea: Use stochastic gradient methods to solve it anyway


0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Avera

ge F

ram

e A

ccura

cy (

%)

Accuracy on Test Set

SGDGPU

Downpour SGD

Downpour SGD w/AdagradSandblaster L−BFGS

(Dean et al. 2012)

Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)

Conclusions and Discussion

• Family of algorithms that adapt to geometry of data

• Extendable to full matrix case to handle feature correlation

• Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity

• Future: Efficient full-matrix adaptivity, other types of adaptation

Conclusions and Discussion

• Family of algorithms that adapt to geometry of data

• Extendable to full matrix case to handle feature correlation

• Can derive many efficient algorithms for high-dimensionalproblems, especially with sparsity

• Future: Efficient full-matrix adaptivity, other types of adaptation

Thanks!

OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt

1

2‖xt+1 − x∗‖22 ≤

1

2‖xt − x∗‖22 + η (ft(x∗)− ft(xt)) +

η2

2‖gt‖22

• Now: xt+1 = xt − ηA−1gt1

2‖xt+1 − x∗‖2A

=1

2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+

η2

2‖gt‖2A−1

≤ 12‖xt − x∗‖2A + η (ft(x∗)− ft(xt)) +

η2

2‖gt‖2A−1↑

dual norm to ‖·‖A

OGD Sketch: “Almost” Contraction• Have gt ∈ ∂ft(xt) (ignore ϕ, X for simplicity)• Before: xt+1 = xt − ηgt

1

2‖xt+1 − x∗‖22 ≤

1

2‖xt − x∗‖22 + η (ft(x∗)− ft(xt)) +

η2

2‖gt‖22

• Now: xt+1 = xt − ηA−1gt1

2‖xt+1 − x∗‖2A

=1

2‖xt − x∗‖2A + η 〈gt, x∗ − xt〉+

η2

2‖gt‖2A−1

≤ 12‖xt − x∗‖2A + η (ft(x∗)− ft(xt)) +

η2

2‖gt‖2A−1↑

dual norm to ‖·‖ADuchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 29 / 32
Hindsight minimization

• Focus on diagonal case (full matrix case similar)

mins

T∑t=1

〈gt, diag(s)

−1gt〉

subject to s � 0, 〈1, s〉 ≤ C

• Let g1:T,j be vector of jth component. Solution is of form

sj ∝ ‖g1:T,j‖2


T∑t=1

ft(xt) + ϕ(xt)− ft(x∗)− ϕ(x∗)

≤ 12η

T∑t=1

(‖xt − x∗‖2At − ‖xt+1 − x

∗‖2At)

︸︷︷︸Term I

+η

2

T∑t=1

‖gt‖2A−1t︸︷︷︸Term II

Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:

T∑t=1

(‖xt − x∗‖2At − ‖xt+1 − x

∗‖2At)≤ D2∞

d∑j=1

‖g1:T,j‖2

• Term II:T∑t=1

‖gt‖2A−1t ≤ 2T∑t=1

‖gt‖2A−1T = 2d∑j=1

‖g1:T,j‖2

= 2

√√√√infs

{T∑t=1

〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤ d∑

j=1

‖g1:T,j‖2

}

Bounding TermsDefine D∞ = maxt ‖xt − x∗‖∞ ≤ supx∈X ‖x− x∗‖∞• Term I:

T∑t=1

(‖xt − x∗‖2At − ‖xt+1 − x

∗‖2At)≤ D2∞

d∑j=1

‖g1:T,j‖2

• Term II:T∑t=1

‖gt‖2A−1t ≤ 2T∑t=1

‖gt‖2A−1T = 2d∑j=1

‖g1:T,j‖2

= 2

√√√√infs

{T∑t=1

〈gt, diag(s)−1gt〉∣∣∣ s � 0, 〈1, s〉 ≤ d∑

j=1

‖g1:T,j‖2

}


Brief Review of Online Convex OptimizationAdaGrad AlgorithmExamples and ResultsConclusions

Adaptive Subgradient Methods for Online Learning and ...jduchi/projects/DuchiHaSi12_ismp.pdfFinal...

Documents

Transcript of Adaptive Subgradient Methods for Online Learning and ...jduchi/projects/DuchiHaSi12_ismp.pdfFinal...

Distributed Subgradient Methods for Multi-agent …...1 Distributed Subgradient Methods for Multi-agent Optimization∗ Angelia Nedi´c† and Asuman Ozdaglar‡ October 29, 2007 Abstract

Adaptive Subgradient Methods for Online Learning and Stochastic …web.stanford.edu/~jduchi/projects/DuchiHaSi11.pdf · 2014. 9. 5. · Adaptive Subgradient Methods Using this notation,

Subgradient methods - Princeton Universityyc5/ele522_optimization/lectures/sub... · 2019. 10. 7. · ELE 522: Large-Scale Optimization for Data Science Subgradient methods Yuxin

RSG: Beating Subgradient Method without Smoothness and ...jmlr.csail.mit.edu/papers/volume19/17-016/17-016.pdfRSG: Beating Subgradient Method without Smoothness and Strong Convexity

INCREMENTAL SUBGRADIENT METHODS FOR …

Chapter 8 Stochastic gradient / subgradient methods

Stochastic Subgradient Approach for Solving Linear Support Vector Machines

INCREMENTAL SUBGRADIENT METHODS FOR …angelia/Increm_LIDS.pdf · INCREMENTAL SUBGRADIENT METHODS 1 ... It is well-known that solving dual problems of the type above, ... Convergence

Subgradient Methods for Saddle-Point Problems

Learning CRFs for Image Parsing with Adaptive Subgradient ... · 3. Adaptive Subgradient Descent Learning In this section, we propose an adaptive subgradient de-scent algorithm for

Isr TechnIcal rePorT 2009-15 Distributed subgradient

Accelerate Stochastic Subgradient Method by Leveraging ...

Subgradient Methods for Saddle-Point Problemsangelia/saddle_revised.pdf · Subgradient Methods for Saddle-Point Problems ... Arrow, Hurwicz, and Uzawa proposed continuous-time versions

subgradients strong and weak subgradient calculus ...€¦ · • many algorithms for nondiﬀerentiable convex optimization require only one subgradient at each step, so weak calculus

Nonsmooth optimization: Subgradient Method, Proximal ...yuxiangw/classes/CS292A... · Nonsmooth optimization: Subgradient Method, Proximal Gradient Method Yu-Xiang Wang CS292A (Based

INCREMENTAL STOCHASTIC SUBGRADIENT ALGORITHMS FOR …angelia/sism_submit.pdf · INCREMENTAL STOCHASTIC SUBGRADIENT ALGORITHMS FOR CONVEX OPTIMIZATION S. SUNDHAR RAM, A. NEDIC, AND

Adaptive Subgradient Methods for Online Learning and ...simplecore-dev.intel.com › nervana › wp-content › ...Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Adaptive Subgradient Methods for Online Learning and ...jmlr.org/papers/volume12/duchi11a/duchi11a.pdfAdaptive Subgradient Methods for Online Learning and Stochastic Optimization ...

Stochastic Subgradient Method - UCI

Subgradient Projectors: Extensions, Theory, and ...people.ok.ubc.ca/bauschke/Research/109.pdf · Subgradient Projectors: Extensions, Theory, and Characterizations Heinz H. Bauschke,