Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

128
CoCoA A General Framework for Communication-Efficient Distributed Optimization Virginia Smith Simone Forte Chenxin Ma Martin Takac Martin Jaggi Michael I. Jordan

Transcript of Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Page 1: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoAA General Framework

for Communication-Efficient Distributed Optimization

Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan

Page 2: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning with Large Datasets

Page 3: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning with Large Datasets

Page 4: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Page 5: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning Workflow

Page 6: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

Page 7: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

Page 8: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Page 9: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Page 10: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Page 11: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem: efficiently solving objective

when data is distributed

Page 12: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization

Page 13: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization

Page 14: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization

reduce: w = w � ↵P

k �w

Page 15: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization

reduce: w = w � ↵P

k �w

Page 16: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization

“always communicate”reduce: w = w � ↵

Pk �w

Page 17: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees

“always communicate”reduce: w = w � ↵

Pk �w

Page 18: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Page 19: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Page 20: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”reduce: w = w � ↵

Pk �w

Page 21: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Page 22: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Page 23: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication✗ convergence not guaranteed

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Page 24: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch

Page 25: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch

Page 26: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Page 27: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Page 28: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch✔ convergence guarantees

reduce: w = w � ↵|b|

Pi2b �w

Page 29: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch✔ convergence guarantees✔ tunable communication

reduce: w = w � ↵|b|

Pi2b �w

Page 30: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Page 31: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Page 32: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch Limitations

Page 33: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch Limitations

1. ONE-OFF METHODS

Page 34: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

Page 35: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

Page 36: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

Page 37: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

Page 38: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch Limitations

1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

Page 39: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Mini-batch Limitations

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

Page 40: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

CoCoA-v1

Page 41: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL≥

Page 42: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Page 43: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 44: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 45: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 46: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 47: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinearDual separates across machines (one dual variable per datapoint)

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 48: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual Framework

Page 49: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Page 50: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 51: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 52: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Local objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 53: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Local objective: max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

2

���1

�nA�↵[k]

���2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 54: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Local objective: max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

2

���1

�nA�↵[k]

���2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 55: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

1. Primal-Dual FrameworkGlobal objective:

Local objective:

Can solve the local objective using any internal optimization method

max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

2

���1

�nA�↵[k]

���2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 56: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

2. Immediately Apply Updates

Page 57: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

Page 58: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

FRESHfor i 2 b

�w �↵riP (w)w w +�w

end

w w +�w

Page 59: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

3. Average over K

Page 60: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

3. Average over K

reduce: w = w + 1K

Pk �wk

Page 61: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

COCOA-v1 Limitations

Page 62: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Can we avoid having to average the partial solutions?

COCOA-v1 Limitations

Page 63: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Can we avoid having to average the partial solutions? ✔

COCOA-v1 Limitations

Page 64: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Can we avoid having to average the partial solutions? ✔

[CoCoA+, Ma & Smith, et. al., ICML ’15]

COCOA-v1 Limitations

Page 65: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Can we avoid having to average the partial solutions?

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]

COCOA-v1 Limitations

Page 66: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Can we avoid having to average the partial solutions?

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]

COCOA-v1 Limitations

Page 67: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

Page 68: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

Page 69: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

L1 Regularization

Page 70: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

L1 RegularizationEncourages sparse solutions

Page 71: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Page 72: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Beneficial to distribute by feature

Page 73: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems

Beneficial to distribute by feature

Can we map this to the CoCoA setup?

Page 74: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Solution: Solve Primal Directly

Page 75: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Solution: Solve Primal Directly

PRIMAL DUAL≥

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Page 76: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Solution: Solve Primal Directly

PRIMAL DUAL≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Page 77: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gap

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Page 78: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Page 79: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnet

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Page 80: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Solution: Solve Primal Directly

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnetPrimal separates across machines (one primal variable per feature)

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

Page 81: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Page 82: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

50x speedup

Page 83: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Page 84: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

Page 85: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible

Page 86: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexibleallows for arbitrary internal methods

Page 87: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficientallows for arbitrary internal methods

Page 88: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficientallows for arbitrary internal methods

fast! (strong convergence & low

communication)

Page 89: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

Page 90: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

Page 91: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

Page 92: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

2. CoCoA+ [ICML ’15]

Page 93: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

2. CoCoA+ [ICML ’15]

3.ProxCoCoA+ [current work]

Page 94: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA: A Framework for Distributed Optimization

flexible efficient generalallows for arbitrary internal methods

fast! (strong convergence & low

communication)

works for a variety of ML models

1. CoCoA-v1 [NIPS ’14]

2. CoCoA+ [ICML ’15]

3.ProxCoCoA+ [current work]

CoCoA Framework

Page 95: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoption

Page 96: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoptiongingsmith.github.io/

cocoa/

Page 97: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoptiongingsmith.github.io/

cocoa/

Page 98: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoptiongingsmith.github.io/

cocoa/

Page 99: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoptiongingsmith.github.io/

cocoa/

Page 100: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoptiongingsmith.github.io/

cocoa/

Numerous talks & demos

Page 101: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demos

Page 102: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demosOpen-source code & documentation

Page 103: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demosOpen-source code & documentationIndustry & academic adoption

Page 104: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Thanks!

cs.berkeley.edu/~vsmith

github.com/gingsmith/proxcocoa

github.com/gingsmith/cocoa

Page 105: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Empirical Results in Dataset

Training (n)

Features (d)

Sparsity Workers (K)

url 2M 3M 4e-3% 4

kddb 19M 29M 1e-4% 4

epsilon 400K 2K 100% 8

webspam 350K 16M 0.02% 16

Page 106: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Page 107: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach:

Page 108: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Page 109: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizers

Page 110: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Page 111: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

k↵k1 + �k↵k22

Page 112: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

k↵k1 + �k↵k22

Page 113: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

k↵k1 + �k↵k22

Page 114: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

k↵k1 + �k↵k22

Page 115: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

k↵k1 + �k↵k22

Page 116: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

Page 117: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

Page 118: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

CoCoA+ with smoothing doesn’t work

Page 119: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

A First Approach: CoCoA+ with Smoothing

Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

CoCoA+ with smoothing doesn’t work

Additionally, CoCoA+ distributes by datapoint, not by feature

Page 120: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Better Solution: ProxCoCoA+

Page 121: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

Page 122: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Page 123: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Page 124: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Better Solution: ProxCoCoA+

Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Page 125: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Convergence

Page 126: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

ConvergenceAssumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Page 127: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

gi

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Page 128: Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

gi

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

Theorem 2. Let be -strongly convexµ

T � 1(1�⇥)

µ⌧+nµ⌧ log

n✏

gi