Distributed Asynchronous Online Learning for Natural...

48
lti Distributed Asynchronous Online Learning for Natural Language Processing Kevin Gimpel Dipanjan Das Noah A. Smith

Transcript of Distributed Asynchronous Online Learning for Natural...

Page 1: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Distributed Asynchronous Online Learning for

Natural Language Processing

Kevin Gimpel Dipanjan Das Noah A. Smith

Page 2: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Introduction

� Two recent lines of research in speeding up large learning problems:�Parallel/distributed computing

�Online (and mini-batch) learning algorithms:stochastic gradient descent, perceptron, MIRA, stepwise EM

� How can we bring together the benefits of parallel computing and online learning?

Page 3: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Introduction

� We use asynchronous algorithms (Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009)

� We apply them to structured prediction tasks:� Supervised learning� Unsupervised learning with both convex and non-

convex objectives

� Asynchronous learning speeds convergence and works best with small mini-batches

Page 4: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Problem Setting

� Iterative learning� Moderate to large numbers of training examples� Expensive inference procedures for each example� For concreteness, we start with gradient-based

optimization

� Single machine with multiple processors� Exploit shared memory for parameters, lexicons,

feature caches, etc.� Maintain one master copy of model parameters

Page 5: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

D

Single-Processor Batch Learning

Page 6: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

D

θ

P1

Single-Processor Batch Learning

Time0

Page 7: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

D

g=calc(D, θ) :

θ

P1

Calculate gradient on data using parametersDg θ

θ0

Single-Processor Batch Learning

Time0

Page 8: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

θ1=up(θ0,g) :D

θ1θ

P1

Update using gradient to obtain

θ1=up(θ0,g)

θ0 θ1g

θ0

Single-Processor Batch Learning

Time0

g=calc(D, θ) :Calculate gradient on data using parametersDg θ

Page 9: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Dataset:

Processors: Pi

Parameters: θt

g=calc(D, θ0)

θ1=up(θ0,g) :D

θ

P1

Update using gradient to obtainθ0 θ1g

g=calc(D, θ1)

θ0 θ1

θ1=up(θ0,g)

Single-Processor Batch Learning

Time0

g=calc(D, θ) :Calculate gradient on data using parametersDg θ

Page 10: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Parallel Batch Learning

Gradient: g = g1 + g2 + g3

Page 11: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ1θ

θ1=up(θ0,g)

θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

� One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Gradient: g = g1 + g2 + g3

Parallel Batch Learning

Page 12: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ1θ

θ1=up(θ0,g)

θ0

P2 g2=calc(D2, θ0)

0

� Divide data into parts, compute gradient on parts in parallel

� One processor updates parameters

D = D1 ∪ D2 ∪ D3Dataset:

Processors: Pi

Parameters: θt

g3=calc(D3, θ0)

g1=calc(D1, θ0)

P3

P1

Gradient: g = g1 + g2 + g3

g1=calc(D1, θ1)

g2=calc(D2, θ1)

g3=calc(D3, θ1)

θ2=up(θ1,g

Parallel Batch Learning

Page 13: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ2=u(θ1,g)θ1=u(θ0,g)

Time

θ1θ θ0

P2

0

� Same architecture, just more frequent updates

P3

P1

Gradient: g = g1 + g2 + g3

g1=c(B1

1, θ0)

g2=c(B2

1, θ0)

g3=c(B3

1, θ0)

g1=c(B1

2, θ1)

g2=c(B2

2, θ1)

g3=c(B3

2, θ1)

Parallel Synchronous Mini-Batch LearningFinkel, Kleeman, and Manning (2008)

Mini-batches:

Processors: Pi

Parameters: θt

Bt = B1t ∪ B2

t ∪ B3

t

θ2

g1=c(B1

3,

g2=c(B2

3,

g3=c(B3

3,

Page 14: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

gk

Bj

Page 15: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Time

θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

Bj

Page 16: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

Bj

Page 17: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

Bj

Page 18: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2=u(θ1,g2)

Bj

θ2

Page 19: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

Bj

Page 20: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

θ3=u(θ2,g3)

θ3

Bj

Page 21: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

θ1=u(θ0,g1)

Time

θ1θ θ0

P2

0

� Gradients computed using stale parameters

� Increased processor utilization� Only idle time caused by lock for

updating parameters

P3

P1

Gradient:

Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)

Mini-batches:

Processors: Pi

Parameters: θt

g1=c(B1, θ0)

g2=c(B2, θ0)

g3=c(B3, θ0)

gk

g1=c(B4, θ1)

θ2

g2=c(B5, θ2)θ2=u(θ1,g2)

θ3=u(θ2,g3)

θ3

θ4=u(θ3,g1)

θ5=u(θ4

g1=c(B

g3=c(B6, θ3)

Bj

θ4

Page 22: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Theoretical Results

� How does the use of stale parameters affect convergence?

� Convergence results exist for convex optimization using stochastic gradient descent� Convergence guaranteed when max delay is

bounded (Nedic, Bertsekas, and Borkar, 2001)� Convergence rates linear in max delay (Langford,

Smola, and Zinkevich, 2009)

Page 23: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

4

10k

4

m

HMM

IBM Model 1

CRF

Model

2M42kNStepwise

EM

Unsupervised Part-of-Speech

Tagging

14.2M300kYStepwise

EMWord

Alignment

1.3M15kYStochastic Gradient Descent

Named-Entity Recognition

Convex?MethodTask |θ||D|

� To compare algorithms, we use wall clock time (with a dedicated 4-processor machine)

� m = mini-batch size

Page 24: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

4

m

CRF

Model

1.3M15kYStochastic Gradient Descent

Named-Entity Recognition

Convex?MethodTask |θ||D|

� CoNLL 2003 English data

� Label each token with entity type (person, location, organization, or miscellaneous) or non-entity

� We show convergence in F1 on development data

Page 25: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 2 4 6 8 10 1284

85

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Synchronous (4 processors)Single-processor

Asynchronous Updating Speeds Convergence

All use a mini-batch size of 4

Page 26: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 2 4 6 8 10 1284

85

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Ideal

Comparison with Ideal Speed-up

Page 27: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Why Does Asynchronous Converge Faster?

� Processors are kept in near-constant use� Synchronous SGD leads to idle

processors � need for load-balancing

Page 28: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6 7 885

86

87

88

89

90

91

F1

Synchronous (4 processors)Synchronous (2 processors)Single-processor

0 1 2 3 4 5 6 7 885

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous (4 processors)Asynchronous (2 processors)Single-processor

Clearer improvement for asynchronous algorithms when increasing number ofprocessors

Page 29: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 2 4 6 8 10 1285

86

87

88

89

90

91

Wall clock time (hours)

F1

Asynchronous, no delayAsynchronous, µ = 5Single-processor, no delayAsynchronous, µ = 10Asynchronous, µ = 20

Artificial Delays

After completing a mini-batch, 25% chance of delaying

Delay (in seconds) sampled from max(N (µ, (µ/5)2), 0)

Avg. time per mini-batch = 0.62 s

Page 30: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

10k

m

IBM Model 1

Model

14.2M300kYStepwise

EMWord

Alignment

Convex?MethodTask

� Given parallel sentences, draw links between words:

� We show convergence in log-likelihood (convergence in AER is similar)

konnten sie es übersetzen ?

could you translate it ?

|θ||D|

Page 31: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Stepwise EM(Sato and Ishii, 2000; Cappe and Moulines, 2009)

� Similar to stochastic gradient descent in the space of sufficient statistics, with a particular scaling of the update

� More efficient than incremental EM

(Neal and Hinton, 1998)� Found to converge much faster than batch EM

(Liang and Klein, 2009)

Page 32: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

Page 33: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

Asynchronous is no faster than synchronous!

For stepwise EM, mini-batch size = 10,000

Page 34: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

Asynchronous is no faster than synchronous!

For stepwise EM, mini-batch size = 10,000

Page 35: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

Page 36: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

Asynchronous is faster when using small mini-batches

Page 37: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Comparing Mini-Batch Sizes

10 20 30 40 50 60 70 80 90 100

-32

-30

-28

-26

-24

-22

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)

Error from asynchronous

updating

Page 38: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Word Alignment Results

For stepwise EM, mini-batch size = 10,000

Page 39: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)Ideal

Comparison with Ideal Speed-up

For stepwise EM, mini-batch size = 10,000

Page 40: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

MapReduce?

� We also ran these algorithms on a large MapReduce cluster (M45 from Yahoo!)

� Batch EM�Each iteration is one MapReduce job, using

24 mappers and 1 reducer

� Asynchronous Stepwise EM�4 mini-batches processed simultaneously,

each run as a MapReduce job�Each uses 6 mappers and 1 reducer

Page 41: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

MapReduce?

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20Lo

g-Li

kelih

ood

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)

Page 42: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

MapReduce?

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20Lo

g-Li

kelih

ood

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

10 20 30 40 50 60 70 80

-40

-35

-30

-25

-20

Wall clock time (minutes)

Log-

Like

lihoo

d

Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)

Page 43: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Experiments

4

m

HMM

Model

2M42kNStepwise

EM

Unsupervised Part-of-Speech

Tagging

Convex?MethodTask |θ||D|

� Bigram HMM with 45 states

� We plot convergence in likelihood and many-to-1 accuracy

Page 44: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65

Wall clock time (hours)

Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)

Part-of-Speech Tagging Results

mini-batch size = 4 for stepwise EM

Page 45: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65

Wall clock time (hours)

Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Ideal

Comparison with Ideal

Page 46: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

0 1 2 3 4 5 6

-7.5

-7

-6.5

-6x 10

6

Log-

Like

lihoo

d

0 1 2 3 4 5 6

50

55

60

65

Wall clock time (hours)

Acc

urac

y (%

)

Asynch. Stepwise EM (4 processors)Ideal

Comparison with Ideal

Asynchronous better than ideal?

Page 47: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Conclusions and Future Work

� Asynchronous algorithms speed convergence and do not introduce additional error

� Effective for unsupervised learning and non-convex objectives

� If your problem works well with small mini-batches, try this!

� Future work� Theoretical results for non-convex case� Explore effects of increasing number of processors� New architectures (maintain multiple copies of )θ

Page 48: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay

lti

Thanks!