Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount...

Post on 31-May-2020

3 views 0 download

Transcript of Stochastic Optimization for Big Data Analyticstyng/sdm14-tutorial.pdfBig Data Challenge Huge amount...

Stochastic Optimization for Big Data Analytics

Tianbao Yang‡, Rong Jin†, Shenghuo Zhu‡

Tutorial@SDM 2014Philadelphia, Pennsylvania

‡NEC Laboratories America, †Michigan State University

April 26, 2014

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 1 / 99

The updates are available here

http://www.cse.msu.edu/˜yangtia1/sdm14-tutorial.pdf

Thanks.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 2 / 99

Some Claims

NoThis tutorial is not an exhaustive literature surveyThe algorithms are not necessary the best for small dataThe theories may not carry over to non-convex optimization

Yesstart-of-the-art Stochastic Optimization for SVM, Logistic Regression,Least Square Regression, LASSOA Generic Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 3 / 99

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 4 / 99

Machine Learning and STochastic OPtimization (STOP)

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 5 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Introduction

Machine Learning problems and Stochastic OptimizationClassification and Regression in different forms

Motivation to employ STochastic OPtimization (STOP)

Basic Convex Optimization Knowledge

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 6 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Three Steps for Machine Learning and Pattern Recognition

Model Optimization

20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

0.3

iterations

dist

ance

to o

ptim

al o

bjec

tive

0.5T

1/T2

1/T

Data

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 7 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Least Square Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2 +

λ

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 8 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Least Square Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2

︸ ︷︷ ︸Empirical Loss

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 9 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Least Square Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2 +

λ

2 ‖w‖22︸ ︷︷ ︸

Regularization

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 10 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Classification Problems:

minw∈Rd

1n

n∑i=1

`(yiw>xi ) +λ

2 ‖w‖22

yi ∈ {+1,−1}: labelLoss function `(z): z = yw>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 11 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Feature Selection:

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + λ‖w‖1

`1 regularization ‖w‖1 =∑d

i=1 |wi |λ controls sparsity level

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 12 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Feature Selection using Elastic Net:

minw∈Rd

1n

n∑i=1

`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2

2

)

Elastic net regularizer, more robust than `1 regularizer

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 13 / 99

Machine Learning and STochastic OPtimization (STOP) Introduction

Learning as Optimization

Multi-class/Multi-task Learning:

minW

1n

n∑i=1

`(Wxi , yi ) + λR(W)

W ∈ RK×d

R(W) = ‖W‖2F =

∑Kk=1

∑dj=1 W 2

kj : Frobenius NormR(W) = ‖W‖∗ =

∑i σi : Nuclear Norm (sum of singular values)

R(W) = ‖W‖1,∞ =∑d

j=1 ‖W:j‖∞: `1,∞mixed normExtensions to Matrix Cases are possible

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 14 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Big Data Challenge

Huge amount of data generated every dayFacebook users upload 3 million photosGoolge receives 3 billion queriesYoutube users upload over 1,700 hours videoGlobal internet population is 2.1 billion people247 billion emails sent

Data Analyticshttp://www.visualnews.com/2012/06/19/how-much-data-created-every-minute/

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 15 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Why Learning from Big Data is Hard?

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer

Too many data pointsIssue: can’t afford go through data set many timesSolution: Stochastic Optimization

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 16 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Why Learning from Big Data is Hard?

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer

High dimensional dataIssue: can’t afford second order optimization (Newton’s method)Solution: first order method (i.e, gradient based method)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 17 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Why Learning from Big Data is Hard?

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer

Data are distributed over many machinesIssue: expensive (if not impossible) to move dataSolution: Distributed Optimization

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 18 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Stochastic Optimization

Stochastic Optimization:

minw∈W

F (w) = Eξ[f (w; ξ)]

f (w; ξ) is convex, F (w) is convexξ random variable

Methods:1 Sample Average Approximation, ξ1, . . . , ξn

minw∈W

1n

n∑i=1

f (w; ξi )

2 Stochastic Approximation: ∇f (w; ξ) (stochastic gradient)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 19 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Stochastic Optimization

Stochastic Optimization:

minw∈W

F (w) = Eξ[f (w; ξ)]

f (w; ξ) is convex, F (w) is convexξ random variable

Methods:1 Sample Average Approximation, ξ1, . . . , ξn

minw∈W

1n

n∑i=1

f (w; ξi )

2 Stochastic Approximation: ∇f (w; ξ) (stochastic gradient)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 19 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Machine Learning is Stochastic Optimization

Goal:minw∈W

Eξ=(x,y)[Loss(w>x, y)]

Empirical Regularized Loss Minimization

minw∈W

1n

n∑i=1︸ ︷︷ ︸

Eξ=i

[Loss(w>xi , yi ) + λR(w)

]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Machine Learning is Stochastic Optimization

Goal:minw∈W

Eξ=(x,y)[Loss(w>x, y)]

Empirical Regularized Loss Minimization

minw∈W

1n

n∑i=1︸ ︷︷ ︸

Eξ=i

[Loss(w>xi , yi ) + λR(w)

]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Machine Learning is Stochastic Optimization

Goal:minw∈W

Eξ=(x,y)[Loss(w>x, y)]

Empirical Regularized Loss Minimization

minw∈W

1n

n∑i=1︸ ︷︷ ︸

Eξ=i

[Loss(w>xi , yi ) + λR(w)

]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

The Simplest Method for Stochastic OptimizationStochastic Optimization:

minw∈Rd

F (w) = Eξ[f (w; ξ)]

Stochastic Gradient Descent (Nemirovski & Yudin, 1978)

wt = wt−1 − γt∇f (wt−1; ξt)

step size

Stochastic Gradient

Eξt [∇f (w; ξt)] = ∇F (w)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

The Simplest Method for Stochastic OptimizationStochastic Optimization:

minw∈Rd

F (w) = Eξ[f (w; ξ)]

Stochastic Gradient Descent (Nemirovski & Yudin, 1978)

wt = wt−1 − γt∇f (wt−1; ξt)

step size

Stochastic Gradient

Eξt [∇f (w; ξt)] = ∇F (w)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

The Simplest Method for Stochastic OptimizationStochastic Optimization:

minw∈Rd

F (w) = Eξ[f (w; ξ)]

Stochastic Gradient Descent (Nemirovski & Yudin, 1978)

wt = wt−1 − γt∇f (wt−1; ξt)

step size

Stochastic Gradient

Eξt [∇f (w; ξt)] = ∇F (w)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99

Machine Learning and STochastic OPtimization (STOP) Motivation

Stochastic Gradient in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

let it ∈ {1, . . . , n} uniformly randomly sampled

key equation: Eit [∇`(w>xit , yit ) + λw] = ∇F (w)

computation is very cheap O(d) compared with full gradient O(nd)

wt = (1− γtλ)wt−1 − γt∇`(w>t−1xit , yit )

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 22 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Vector, Norm, Inner product, Dual Norm

bold letters x ∈ Rd (data vector), w ∈ Rd (model parameter) :d-dimensional vectors, yi denotes response variable of ith datax , y ∈ X finite dimensional variable, X a normed space

norm ‖w‖: Rd → R+. e.g.1 `2 norm ‖w‖2 =

√∑di=1 w2

i2 `1 norm ‖w‖1 =

∑di=1 |wi |

3 `∞ norm ‖w‖∞ = maxi |wi |

inner product 〈x,w〉 = x>w =∑d

i=1 xiwi

dual norm ‖w‖∗ = max‖x‖≤1 x>w.1 ‖x‖2 ⇐⇒ ‖w‖22 ‖x‖1 ⇐⇒ ‖w‖∞

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 23 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convex Optimization

minx∈X f (x)

X is a convex domain

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

f (x) is a convex function

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 24 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convex Function

Characterization of Convex Function

x y

f(x)

f(y)

↵x + (1 � ↵)y

f(↵x + (1 � ↵)y)

↵f(x) + (1 � ↵)f(y) f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

y

f(x)

f(y) + rf(y)>(y � x)f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 25 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Convergence Measure

Most optimization algorithms are iterative

xt+1 = xt + ∆xt

Iteration Complexity: the number ofiterations T (ε) needed to have

f (xT )− minx∈X

f (x) ≤ ε (ε� 1)

Convergence Rate: after T iterations, howgood is the solution

f (xT )−minx∈X

f (x) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration ComplexityYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Big O(·) notation: explicit dependence on T or ε

Convergence Rate Iteration Complexity

linear O(µT)

(µ < 1) O(

log(1ε

))sub-linear O

(1

)α > 0 O

( 1ε1/α

)Why are we interested in Bounds?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 27 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Big O(·) notation: explicit dependence on T or ε

Convergence Rate Iteration Complexity

linear O(µT)

(µ < 1) O(

log(1ε

))sub-linear O

(1

)α > 0 O

( 1ε1/α

)Why are we interested in Bounds?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 27 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

seconds

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 28 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

secondsminutes

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 29 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

1/T0.5

secondsminutes

hours

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 30 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

1/T0.5

secondsminutes

hours

Theoretically, we consider

O(µT ) ≺ O( 1

T 2

)≺ O

( 1T

)≺ O

( 1√T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 31 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Factors that affect Iteration Complexity

Property of function: e.g., smoothness of function

Domain X : size and geometry

Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Factors that affect Iteration Complexity

Property of function: e.g., smoothness of function

Domain X : size and geometry

Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Factors that affect Iteration Complexity

Property of function: e.g., smoothness of function

Domain X : size and geometry

Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x) = |x |

|f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y)+∂f (y)>(x−y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 33 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Non-smooth function

Lipschitz continuous: e.g. absolute loss f (x) = |x |

|f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y)+∂f (y)>(x−y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 33 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Smooth Convex function

smooth: e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2

where β > 0

smoothnessconstant

Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ β

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 34 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Smooth Convex function

smooth: e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2

where β > 0

smoothnessconstant

Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ β

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 34 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Strongly Convex function

strongly convex: e.g. Euclidean norm f (x) = 12‖x‖

22

‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2

where λ > 0

strong convexityconstant

Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 35 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Strongly Convex function

strongly convex: e.g. Euclidean norm f (x) = 12‖x‖

22

‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2

where λ > 0

strong convexityconstant

Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 35 / 99

Machine Learning and STochastic OPtimization (STOP) Warm-up

Warm-up: Smooth and Strongly Convex function

smooth and strongly convex: e.g. quadratic function:f (z) = 1

2 (z − 1)2

λ‖x − y‖2 ≤ ‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2, β ≥ λ > 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 36 / 99

STOP Algorithms for Big Data Classification and Regression

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 37 / 99

STOP Algorithms for Big Data Classification and Regression

STOP Algorithms for Big Data Classification andRegression

Stochastic Gradient Descent (Pegasos) for SVM

Stochastic Average Gradient (SAG) for Logistic Regression andRegression

Stochastic Dual Coordinate Ascent (SDCA)

Stochastic Optimization for Lasso

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 38 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Classification and Regression

Classification and Regression

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2

3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))

Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2

2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Timeline of Stochastic Optimization in Machine Learning

minw∈Rd F (w) = 1n∑n

i=1 `(w>xi , yi ) + λ2‖w‖

22

•  [Zinkevich03].•  [Kivinen04].

General.Convex.

•  [Hazan07].•  [Shalev<Shwartz07].

Strongly.Convex.

•  [Roux12].•  [Shalev<Shwartz13].

•  [Zhang13].

Smooth.&.

Strongly.

O

✓1pT

2007.

O

✓1

T

◆O�µT

2012.90’s. 2003.

Basic.SGD. Pegasos. SAG,......SDCA.

wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit )

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 40 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Basic SGD

Leveraging only convexity

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

c√t

output solution: wT =1T

T∑t=1

wt ⇒ O( 1√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 41 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Basic SGD

Leveraging only convexity

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

c√t

output solution: wT =1T

T∑t=1

wt ⇒ O( 1√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 41 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Pegasos (Shalev-Shwartz et al. (2007))

Leveraging strongly convex regularizer

minw∈Rd

F (w) =1n

n∑i=1

max(0, 1− yiw>xi ) +λ

2 ‖w‖22︸ ︷︷ ︸

strongly convex

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

1λt

output solution: wT =1T

T∑t=1

wt ⇒ O( 1λT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 42 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Pegasos (Shalev-Shwartz et al. (2007))

Leveraging strongly convex regularizer

minw∈Rd

F (w) =1n

n∑i=1

max(0, 1− yiw>xi ) +λ

2 ‖w‖22︸ ︷︷ ︸

strongly convex

update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =

1λt

output solution: wT =1T

T∑t=1

wt ⇒ O( 1λT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 42 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Pegasos (Shalev-Shwartz et al. (2007))

e.g. hinge loss (SVM), absolute loss (Least Absolute Deviation)

stochastic gradient: ∂`(w>xit , yit ) =

−yit xit , 1− yit w>xit > 0

0, otherwise

computation cost per-iteration: O(d)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 43 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SAG (Roux et al. (2012))

Leveraging smoothness of loss

minw∈Rd

f (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22︸ ︷︷ ︸

smooth and strongly convexEstimated Average Gradient

Gt =1n

n∑i=1

g ti , g t

i =

{∂`(w>t xit , yit ), if it is selectedg t−1

i , otherwise

update: wt = (1− γtλ)wt−1 − γtGt , γt =cβ

output solution: wT ⇒ O(µT)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 44 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SAG (Roux et al. (2012))

Leveraging smoothness of loss

minw∈Rd

f (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22︸ ︷︷ ︸

smooth and strongly convexEstimated Average Gradient

Gt =1n

n∑i=1

g ti , g t

i =

{∂`(w>t xit , yit ), if it is selectedg t−1

i , otherwise

update: wt = (1− γtλ)wt−1 − γtGt , γt =cβ

output solution: wT ⇒ O(µT)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 44 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SAG: efficient update of averaged gradient

logistic regression, least square regression, smooth SVMindividual gradient

gi = ∂`(w>xi , yi ) = αixi

update of averaged gradient

Gt =1n

n∑i=1

g ti =

1n

n∑i=1

αti xi = Gt−1 +

1n (αt

i − αt−1i )xit

computation cost per-iteration: O(d)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 45 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))

Dual Problem:

maxα∈Q

D(α) =1n

n∑i=1−φ∗(−αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

primal solution: wt =1λn

n∑i=1

αti xi

Dual Coordinate Updates

∆αi = maxαt

i +∆αi∈Q−φ∗(−αt

i −∆αi )−λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))

Dual Problem:

maxα∈Q

D(α) =1n

n∑i=1−φ∗(−αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

primal solution: wt =1λn

n∑i=1

αti xi

Dual Coordinate Updates

∆αi = maxαt

i +∆αi∈Q−φ∗(−αt

i −∆αi )−λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))

Dual Problem:

maxα∈Q

D(α) =1n

n∑i=1−φ∗(−αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

primal solution: wt =1λn

n∑i=1

αti xi

Dual Coordinate Updates

∆αi = maxαt

i +∆αi∈Q−φ∗(−αt

i −∆αi )−λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

SDCA updates

close form solution: hinge loss, squared hinge loss, absolute loss andsquare loss (Shalev-Shwartz & Zhang (2013))e.g. square loss

∆αti =

yi −w>t xi − αt−1i

1 + ‖xi‖22/(λn)

computation cost per-iteration: O(d)

approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 47 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex

yes yes yes

smooth

no yes yes/no

loss

hinge, abs. logistic, square, sqh all left

memory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth

no yes yes/no

loss

hinge, abs. logistic, square, sqh all left

memory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss

hinge, abs. logistic, square, sqh all left

memory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost

O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗

O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration

O(

1λε

)O(

log(

))O(

1λε

)

Complexity

O(

log(

))

paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration O(

1λε

)O(

log(

))O(

1λε

)Complexity O

(log(

))paramter

no step size no

averaging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration O(

1λε

)O(

log(

))O(

1λε

)Complexity O

(log(

))paramter no step size noaveraging

yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Summary

alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)

computation cost∗ O(d) O(d) O(d)

Iteration O(

1λε

)O(

log(

))O(

1λε

)Complexity O

(log(

))paramter no step size noaveraging yes no need no need

Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

What about `1 regularization?

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1︸ ︷︷ ︸

Lasso or Group Lasso

Issue: Regularizer is Not Strongly Convex

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 49 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Adding `2 regularization (Shalev-Shwartz & Zhang, 2012)

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1︸ ︷︷ ︸

Lasso or Group Lasso

Issue: Not Strongly Convex Solution: Add `22 regularization

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1 +

λ

2 ‖w‖22

setting λ = Θ(1/ε), SDCA (non-smooth or smooth)

O( 1ε2

)for general convex loss, O

(1ε

)for smooth loss

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 50 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Adding `2 regularization (Shalev-Shwartz & Zhang, 2012)

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1︸ ︷︷ ︸

Lasso or Group Lasso

Issue: Not Strongly Convex Solution: Add `22 regularization

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + σK∑

g=1‖wg‖1 +

λ

2 ‖w‖22

setting λ = Θ(1/ε), SDCA (non-smooth or smooth)

O( 1ε2

)for general convex loss, O

(1ε

)for smooth loss

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 50 / 99

STOP Algorithms for Big Data Classification and Regression Algorithms

Other algorithms for `1 regularization

(Stochastic) Proximal Gradient DescentProximal Stochastic Gradient Descent (Langford et al., 2009;Shalev-Shwartz & Tewari, 2009; Duchi & Singer, 2009)sparsity can be achieved at each iterationO(1/ε2) iteration complexity

Stochastic Coordinate Descent (Shalev-Shwartz & Tewari, 2009;Bradley et al., 2011; Richtarik & Takac, 2013)

need to compute full gradientO(n/ε) iteration complexity for smooth loss

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 51 / 99

General Strategies for Stochastic Optimization

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 52 / 99

General Strategies for Stochastic Optimization

Be Back in 5 minutes

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 53 / 99

General Strategies for Stochastic Optimization

General Strategies for Stochastic Optimization

General strategies for STOPSGD and its variants for different objectives

Parallel and Distributed Optimization

Other Effective Strategies

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 54 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Stochastic Gradient Descent

minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable

basic SGD updates:

xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]

ΠX [x ] = minx∈X ‖x − x‖22

Issue: How to determine learning rate (step size) γt?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Stochastic Gradient Descent

minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable

basic SGD updates:

xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]

ΠX [x ] = minx∈X ‖x − x‖22

Issue: How to determine learning rate (step size) γt?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Stochastic Gradient Descent

minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable

basic SGD updates:

xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]

ΠX [x ] = minx∈X ‖x − x‖22

Issue: How to determine learning rate (step size) γt?

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Convergence of final solution

Iterative updates

xt = ΠX [xt−1 − γt∆t ]

to have convergence, intuitively γt∆t → 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 56 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Convergence of (S)GD

iterative updates

xt = ΠX [xt−1 − γt∆t ]

GD: xt = xt−1 − γt∇f (xt−1)

∇f (xt−1)→ 0, xt → x∗

SGD: xt = xt−1 − γt∇f (xt−1; ξt)

γt∇f (xt−1; ξt)→ 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 57 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Three Schemes of Step size

General Convex Optimization γt ∝ 1/√

t → 0

Strongly Convex Optimization γt ∝ 1/t → 0

Smooth Optimization γt = c,∆t → 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 58 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for General Convex Function

Step size: γt = c√t , c usually needed to be tuned

Convergence rate of final solution xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(DG log T√

T

)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O

(DG√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for General Convex Function

Step size: γt = c√t , c usually needed to be tuned

Convergence rate of final solution xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(DG log T√

T

)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O

(DG√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for General Convex Function

Step size: γt = c√t , c usually needed to be tuned

Convergence rate of final solution xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(DG log T√

T

)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O

(DG√

T

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Strongly Convex Function

f (x) is λ-strongly convex

Step size: γt =1λt

Convergence Rate of xT (Shamir & Zhang, 2013):

E[f (xT )− f (x∗)] ≤ O(

G2 log TλT

)

Close to Optimal : O(

G2

λT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 60 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

SGD for Smooth Convex Function

A sub-class of general convex function

SGD with γt ∝ 1/√

t has O( log T√

T

)Gradient Descent with γt = c has O

( 1T

)(Nesterov, 2004)

Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.

f (x) =1n

n∑i=1

fi (x), ∇f (xt) =1n

n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)

Constant step size of GD is due to ∇f (x∗) = 0

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Accelerated SGD for smooth function (Johnson & Zhang, 2013;

Mahdavi et al., 2013)

Iterate s = 1, . . . ,Iterate t = 1, . . . ,m

x st = x s

t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad

update x s

x s = x sm or x s =

∑mt=1 x s

t /m, m = O(n)

constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s

t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O

(log(

))

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Averaged Stochastic Gradient Descent

Averaging usually speed-up convergence:

xt =

(1− 1 + η

t + η

)xt−1 +

1 + η

t + ηxt , η ≥ 0

η = 0 simple averaging xT = (x1 + . . .+ xT )/T

General Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O(

1√T

)vs O

(log T√

T

)Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O

(1λT

)vs O

(log TλT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Averaged Stochastic Gradient Descent

Averaging usually speed-up convergence:

xt =

(1− 1 + η

t + η

)xt−1 +

1 + η

t + ηxt , η ≥ 0

η = 0 simple averaging xT = (x1 + . . .+ xT )/TGeneral Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O

(1√T

)vs O

(log T√

T

)

Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O

(1λT

)vs O

(log TλT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99

General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants

Averaged Stochastic Gradient Descent

Averaging usually speed-up convergence:

xt =

(1− 1 + η

t + η

)xt−1 +

1 + η

t + ηxt , η ≥ 0

η = 0 simple averaging xT = (x1 + . . .+ xT )/T

General Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O(

1√T

)vs O

(log T√

T

)

Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O

(1λT

)vs O

(log TλT

)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

General Strategies for Stochastic Optimization

General strategies for STOPSGD and its variants for different objectives

Parallel and Distributed Optimization

Other Effective Strategies

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 64 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Parallel and Distributed Optimization

Parallel (shared memory)

Distributed (not shared)

To speed-up convergence

data distributed over multiple machines

moving to single machine sufferslow network bandwidthlimited disk or memory

benefits fromcluster of machinesmulti-core machine, GPU

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 65 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

A simple solution: Average Runs

multi-core machinecluster of machines

Data

w1 w2 w3 w4 w5 w6

w =1k

k∑i=1

wi , Issue: Not the Optimal

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 66 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Parallel SGD: Average Gradients

Mini-batch

synchronization

Mini-batch SGD

multi-core or clusterGood: reduced variance, faster conv.Bad: synchronization is expensiveSolutions:

asynchronized update: HogWild!

lesser synchronizations: DisDCA

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 67 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Lock-free Parallel SGD: HOGWILD! (Niu et al., 2011)

minx

∑e∈E

fe(xe)

multi-core with shared-memory accesseach e is a small subset of [d ]

sparse SVM, matrix completion, graph-cutrobust 1/T convergence rate for strongly convex objective

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 68 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Distributed SDCA (Yang, 2013)

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

convergence is not guaranteed: data are correlated

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2K

∥∥∥∥wt +Kλn∆αixi

∥∥∥∥2

2

guaranteed convergence; limited speed-up

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 69 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

Distributed SDCA (Yang, 2013)

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2

∥∥∥∥wt +1λn∆αixi

∥∥∥∥2

2

convergence is not guaranteed: data are correlated

∆αi = arg max−φ∗i (−αti −∆αi )−

λn2K

∥∥∥∥wt +Kλn∆αixi

∥∥∥∥2

2

guaranteed convergence; limited speed-up

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 69 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA: Trading Computation for Communication

∆αij = arg max−φ∗ij (−αtij −∆αij )−

λn2K

∥∥∥∥utj +

Kλn∆αij xij

∥∥∥∥2

2

utj+1 = ut

j +Kλn∆αij xij

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 70 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA: Trading Computation for Communication

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 71 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA: Trading Computation for Communication

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 72 / 99

General Strategies for Stochastic Optimization Parallel and Distributed Optimization

DisDCA

increasing m could lead to nearly linear speed-upincreasing K leads to parallel speed-up

0 2 4 6 8 10x 105

−10

−8

−6

−4

−2

0

2¡(t,m) vs t

t

log(¡(t,

m) )

m=10m=100m=1000

104 106105

(a) 1 million syn. data for regression

0" 5" 10" 15"

Liblinear"

DisDCA"0.1"

0.01"

0.001"

0.0001"

n = 109

n = 107

1"minute"3"minutes"

7"minutes"12"minutes"

7"minutes"

✏(T )

(b) 1 billion syn. data for classification; 400 GB,50*2 processors

The Distributed Library: BirdsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 73 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

General Strategies for Stochastic Optimization

General strategies for STOPSGD and its variants for different objectives

Parallel and Distributed Optimization

Other Effective Strategies

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 74 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Factors that affect Iteration Complexity

Property of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

Screening for Lasso and Support Vector Machine

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 75 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Factors that affect Iteration Complexity

Property of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

Screening for Lasso and Support Vector Machine

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 75 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Lasso

Lasso: minw∈Rd

12‖y− Xw‖2

2 + λ‖w‖1

y = (y1, . . . , yn)> ∈ Rn

X = (x1, · · · , xd ) ∈ Rn×d

I0 = {i : w∗i = 0}, I = [d ]\I0

Lasso: minwI

12‖y− XIwI‖2

2 + λ‖wI‖1

SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Lasso

Lasso: minw∈Rd

12‖y− Xw‖2

2 + λ‖w‖1

y = (y1, . . . , yn)> ∈ Rn

X = (x1, · · · , xd ) ∈ Rn×d

I0 = {i : w∗i = 0}, I = [d ]\I0

Lasso: minwI

12‖y− XIwI‖2

2 + λ‖wI‖1

SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Lasso

Lasso: minw∈Rd

12‖y− Xw‖2

2 + λ‖w‖1

y = (y1, . . . , yn)> ∈ Rn

X = (x1, · · · , xd ) ∈ Rn×d

I0 = {i : w∗i = 0}, I = [d ]\I0

Lasso: minwI

12‖y− XIwI‖2

2 + λ‖wI‖1

SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Screening for Support Vector Machine

Dual SVM: maxα∈[0,1]n

1n

n∑i=1

αi −λ

2

∥∥∥∥∥ 1λn

n∑i=1

αiyixi

∥∥∥∥∥2

2

yiw∗xi < 1⇒ α∗i = 1, yiw∗xi > 1⇒ α∗i = 0

Ball Test (Ogawa et al., 2014; Wang et al.,2013)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 77 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Factors that affect Iteration Complexity

Property of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

Stochastic Mirror Descent

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 78 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing G ,D

Iteration Complexity of SGD depends on:‖x− x∗‖2 ≤ D, ‖∇f (x; ξ)‖2 ≤ G : positive correlation

Interpretation of Gradient Descent

xt =∏X

[xt−1 − γt∇f (xt−1; ξt)]

= minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1

2γt‖x − xt−1‖2

2︸ ︷︷ ︸distance to last solution

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖2‖∇f (xt−1; ξt)‖2 ≤ GD

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 79 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Stochastic Mirror Descent (Nemirovski et al., 2009)

xt = minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1γt

B(x , xt−1)︸ ︷︷ ︸Bregman Divergence

B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)

B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗

E[f (xT )]− f (x∗) ≤ O(DG√

T

)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Stochastic Mirror Descent (Nemirovski et al., 2009)

xt = minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1γt

B(x , xt−1)︸ ︷︷ ︸Bregman Divergence

B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)

B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗

E[f (xT )]− f (x∗) ≤ O(DG√

T

)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Stochastic Mirror Descent (Nemirovski et al., 2009)

xt = minx∈X

f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation

+1γt

B(x , xt−1)︸ ︷︷ ︸Bregman Divergence

B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)

B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm

(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗

E[f (xT )]− f (x∗) ≤ O(DG√

T

)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing Projections

Factors that affect Iteration ComplexityProperty of function: smoothness of function

Size of problem: dimension and number of data points

Domain X : size and geometry

xt =∏X

[xt−1 −∇f (xt−1; ξt)]

Complex Domain X leads to Expensive computationse.g.,PSD cone

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 81 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing Projections

Linear Optimization over the Domain: Frank-WolfeAlgorithm (Jaggi, 2013; Lacoste-Julien et al., 2013; Hazan, 2008)

st = arg maxs∈X〈s,∇f (xt−1)〉 : linear optimization

xt = (1− ηt)xt−1 + ηtst : xt ∈ X

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 82 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

Reducing Projections

Few Projections: SGD with only-one or log T projection (Mahdaviet al., 2012; Yang & Zhang, 2013)

x ∈ X ⇐⇒ g(x) ≤ 0

SGD for min-max

minx maxλ≥0 f (x) + λg(x)

objectiveviolation ofconstraints

Final projection

xT =∏X[ 1

T∑T

t=1 xt]

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 83 / 99

General Strategies for Stochastic Optimization Other Effective Strategies

How about kernel methods?

Linearization + STOP for linear methodsthe Nystrom method (Drineas & Mahoney, 2005)

Random Fourier Features (Rahimi & Recht, 2007)

Comparison of two (Yang et al., 2012)the Nystrom method: data dependent sampling, better approximationerror under large eigen-gap and power law eigen-distribution

Random Fourier Features: data independent sampling

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 84 / 99

Implementations and A Distributed Library

Outline

1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up

2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms

3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies

4 Implementations and A Distributed Library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 85 / 99

Implementations and A Distributed Library

Implementations and A Distributed Library

Efficient implementations and a practical libraryEfficient averaging

Gradient sparsification

Distributed (parallel) optimization library

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 86 / 99

Implementations and A Distributed Library

Efficient Averaging

Update rule:

xt = (1− γtλ)xt−1 + γtgt

xt = (1− αt)xt−1 + αtxt

Efficient update when gt has many 0, or gt is sparse,

St =

(1− λγt 0

αt(1− λγt) 1− αt

)St−1, S1 = I

yt = yt−1 − [S−1t ]11γtgt

yt = yt−1 − ([S−1t ]21 + [S−1

t ]22αt)γtgt

xT = [ST ]11yT

xT = [ST ]21yT + [ST ]22yT

When Gradient is SparseYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 87 / 99

Implementations and A Distributed Library

Gradient sparsification

Sparsification by importance sampling

Rti = unif(0, 1)

gti = gti [|gti | ≥ gi ] + gsign(gti )[giRti ≤ |gti | < gi ]

Unbiased sample: Egt = gt .Tradeoff variance increase for the efficient computation.

Especially useful for Logistic Regression

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 88 / 99

Implementations and A Distributed Library

Distributed Optimization Library: Birds

The birds library implements distributed stochastic dual coordinateascent (DisDCA) for classification and regression with a broadsupport.For technical details see:

”Trading Computation for Communication: Distributed StochasticDual Coordinate Ascent.” Tianbao Yang. NIPS 2013.”Analysis of Distributed Stochastic Dual Coordinate Ascent” TianbaoYang, etc. Tech Report 2013, arxiv.

The code is distributed under GNU General Publich License (seelicense.txt for details).

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 89 / 99

Implementations and A Distributed Library

Distributed Optimization Library: Birds

What problems does it solve?Classification and RegressionLoss

1 Hinge loss and squared hinge loss (SVM)2 Logistic loss (Logistic Regression)3 Least Square Regression (Ridge Regression)

Regularizer1 `2 norm: SVM, Logistic Regression, Ridge Regression2 `1 norm: Lasso, SVM, LR with `1 norm

Multi-class : one-vs-all

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 90 / 99

Implementations and A Distributed Library

Distributed Optimization Library: Birds

What data does it support?dense, sparsetxt, binary

What environment does it support?Prerequisites: Boost.MPI and Boost.Serialization LibraryTested on A cluster of Linux machines (up to hundreds of processors)

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 91 / 99

Implementations and A Distributed Library

Thank You!

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 92 / 99

Implementations and A Distributed Library

References I

Bradley, Joseph K., Kyrola, Aapo, Bickson, Danny, and Guestrin, Carlos.Parallel coordinate descent for l1-regularized loss minimization. CoRR,2011.

Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning. J.Mach. Learn. Res., 6:2153–2175, 2005.

Duchi, John and Singer, Yoram. Efficient online and batch learning usingforward backward splitting. J. Mach. Learn. Res., 10:2899–2934, 2009.

Ghaoui, Laurent El, Viallon, Vivian, and Rabbani, Tarek. Safe featureelimination in sparse supervised learning. CoRR, abs/1009.3515, 2010.

Hazan, Elad. Sparse approximate solutions to semidefinite programs. InLATIN, pp. 306–316, 2008.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 93 / 99

Implementations and A Distributed Library

References II

Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, andSundararajan, S. A dual coordinate descent method for large-scale linearsvm. In Proceedings of the 25th International Conference on MachineLearning, ICML ’08, pp. 408–415, 2008.

Jaggi, Martin. Revisiting frank-wolfe: Projection-free sparse convexoptimization. In ICML 2013 - Proceedings of the 30th InternationalConference on Machine Learning, 2013.

Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descentusing predictive variance reduction. In NIPS, pp. 315–323, 2013.

Lacoste-Julien, Simon, Jaggi, Martin, Schmidt, Mark W., and Pletscher,Patrick. Block-coordinate frank-wolfe optimization for structural svms.In ICML (1), volume 28, pp. 53–61, 2013.

Lan, Guanghui. An optimal method for stochastic composite optimization.Math. Program., 133(1-2):365–397, 2012.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 94 / 99

Implementations and A Distributed Library

References III

Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning viatruncated gradient. J. Mach. Learn. Res., 10:777–801, June 2009.

Mahdavi, Mehrdad, Yang, Tianbao, Jin, Rong, Zhu, Shenghuo, and Yi,Jinfeng. Stochastic gradient descent with only one projection. In NIPS,pp. 503–511, 2012.

Mahdavi, Mehrdad, Zhang, Lijun, and Jin, Rong. Mixed optimization forsmooth functions. In NIPS, pp. 674–682, 2013.

Nemirovski, A. and Yudin, D. On cezari?s convergence of the steepestdescent method for approximating saddle point of convex-concavefunctons. Soviet Math Dkl., 19:341–362, 1978.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM J. onOptimization, pp. 1574–1609, 2009.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 95 / 99

Implementations and A Distributed Library

References IV

Nesterov, Yurii. Introductory Lectures on Convex Optimization: A BasicCourse (Applied Optimization). Springer Netherlands, 2004.

Niu, Feng, Recht, Benjamin, Re, Christopher, and Wright, Stephen J.Hogwild!: A lock-free approach to parallelizing stochastic gradientdescent. CoRR, 2011.

Ogawa, Kohei, Suzuki, Yoshiki, Suzumura, Shinya, and Takeuchi, Ichiro.Safe sample screening for support vector machines. CoRR, 2014.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In NIPS, 2007.

Richtarik, Peter and Takac, Martin. Distributed coordinate descentmethod for learning with big data. CoRR, abs/1310.2059, 2013.

Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. A stochasticgradient method with an exponential convergence rate forstrongly-convex optimization with finite training sets. CoRR, 2012.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 96 / 99

Implementations and A Distributed Library

References V

Shalev-Shwartz, Shai and Tewari, Ambuj. Stochastic methods for l1regularized loss minimization. In Proceedings of the 26th AnnualInternational Conference on Machine Learning, ICML ’09, pp. 929–936,2009.

Shalev-Shwartz, Shai and Zhang, Tong. Proximal stochastic dualcoordinate ascent. CoRR, abs/1211.2717, 2012.

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascentmethods for regularized loss. Journal of Machine Learning Research, 14:567–599, 2013.

Shalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings of the24th International Conference on Machine Learning, pp. 807–814, 2007.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 97 / 99

Implementations and A Distributed Library

References VI

Shalev-Shwartz, Shai, Singer, Yoram, Srebro, Nathan, and Cotter, Andrew.Pegasos: primal estimated sub-gradient solver for svm. Math. Program.,127(1):3–30, 2011.

Shamir, Ohad and Zhang, Tong. Stochastic gradient descent fornon-smooth optimization: Convergence results and optimal averagingschemes. In ICML (1), pp. 71–79, 2013.

Wang, Jie, Lin, Binbin, Gong, Pinghua, Wonka, Peter, and Ye, Jieping.Lasso screening rules via dual polytope projection. CoRR,abs/1211.3966, 2012.

Wang, Jie, Wonka, Peter, and Ye, Jieping. Scaling svm and least absolutedeviations via exact data reduction. CoRR, abs/1310.7048, 2013. URLhttp://dblp.uni-trier.de/db/journals/corr/corr1310.html#WangWY13.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 98 / 99

Implementations and A Distributed Library

References VII

Yang, Tianbao. Trading computation for communication: Distributedstochastic dual coordinate ascent. NIPS’13, pp. –, 2013.

Yang, Tianbao and Zhang, Lijun. Efficient stochastic gradient descent forstrongly convex optimization. CoRR, abs/1304.5504, 2013.

Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. ”nystrom method vs random fourier features: A theoreticaland empirical comparison”. In NIPS, pp. 485–493, 2012.

Zhu, Shenghuo. Stochastic gradient descent algorithms for strongly convexfunctions at o(1/t) convergence rates. CoRR, abs/1305.2218, 2013.

Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 99 / 99