Distributed Asynchronous Online Learning for Natural...
Transcript of Distributed Asynchronous Online Learning for Natural...
![Page 1: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/1.jpg)
lti
Distributed Asynchronous Online Learning for
Natural Language Processing
Kevin Gimpel Dipanjan Das Noah A. Smith
![Page 2: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/2.jpg)
lti
Introduction
� Two recent lines of research in speeding up large learning problems:�Parallel/distributed computing
�Online (and mini-batch) learning algorithms:stochastic gradient descent, perceptron, MIRA, stepwise EM
� How can we bring together the benefits of parallel computing and online learning?
![Page 3: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/3.jpg)
lti
Introduction
� We use asynchronous algorithms (Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009)
� We apply them to structured prediction tasks:� Supervised learning� Unsupervised learning with both convex and non-
convex objectives
� Asynchronous learning speeds convergence and works best with small mini-batches
![Page 4: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/4.jpg)
lti
Problem Setting
� Iterative learning� Moderate to large numbers of training examples� Expensive inference procedures for each example� For concreteness, we start with gradient-based
optimization
� Single machine with multiple processors� Exploit shared memory for parameters, lexicons,
feature caches, etc.� Maintain one master copy of model parameters
![Page 5: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/5.jpg)
lti
Dataset:
Processors: Pi
Parameters: θt
D
Single-Processor Batch Learning
![Page 6: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/6.jpg)
lti
Dataset:
Processors: Pi
Parameters: θt
D
θ
P1
Single-Processor Batch Learning
Time0
![Page 7: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/7.jpg)
lti
Dataset:
Processors: Pi
Parameters: θt
g=calc(D, θ0)
D
g=calc(D, θ) :
θ
P1
Calculate gradient on data using parametersDg θ
θ0
Single-Processor Batch Learning
Time0
![Page 8: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/8.jpg)
lti
Dataset:
Processors: Pi
Parameters: θt
g=calc(D, θ0)
θ1=up(θ0,g) :D
θ1θ
P1
Update using gradient to obtain
θ1=up(θ0,g)
θ0 θ1g
θ0
Single-Processor Batch Learning
Time0
g=calc(D, θ) :Calculate gradient on data using parametersDg θ
![Page 9: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/9.jpg)
lti
Dataset:
Processors: Pi
Parameters: θt
g=calc(D, θ0)
θ1=up(θ0,g) :D
θ
P1
Update using gradient to obtainθ0 θ1g
g=calc(D, θ1)
θ0 θ1
θ1=up(θ0,g)
Single-Processor Batch Learning
Time0
g=calc(D, θ) :Calculate gradient on data using parametersDg θ
![Page 10: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/10.jpg)
lti
Time
θ θ0
P2 g2=calc(D2, θ0)
0
� Divide data into parts, compute gradient on parts in parallel
One processor updates parameters
D = D1 ∪ D2 ∪ D3Dataset:
Processors: Pi
Parameters: θt
g3=calc(D3, θ0)
g1=calc(D1, θ0)
P3
P1
Parallel Batch Learning
Gradient: g = g1 + g2 + g3
![Page 11: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/11.jpg)
lti
Time
θ1θ
θ1=up(θ0,g)
θ0
P2 g2=calc(D2, θ0)
0
� Divide data into parts, compute gradient on parts in parallel
� One processor updates parameters
D = D1 ∪ D2 ∪ D3Dataset:
Processors: Pi
Parameters: θt
g3=calc(D3, θ0)
g1=calc(D1, θ0)
P3
P1
Gradient: g = g1 + g2 + g3
Parallel Batch Learning
![Page 12: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/12.jpg)
lti
Time
θ1θ
θ1=up(θ0,g)
θ0
P2 g2=calc(D2, θ0)
0
� Divide data into parts, compute gradient on parts in parallel
� One processor updates parameters
D = D1 ∪ D2 ∪ D3Dataset:
Processors: Pi
Parameters: θt
g3=calc(D3, θ0)
g1=calc(D1, θ0)
P3
P1
Gradient: g = g1 + g2 + g3
g1=calc(D1, θ1)
g2=calc(D2, θ1)
g3=calc(D3, θ1)
θ2=up(θ1,g
Parallel Batch Learning
![Page 13: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/13.jpg)
lti
θ2=u(θ1,g)θ1=u(θ0,g)
Time
θ1θ θ0
P2
0
� Same architecture, just more frequent updates
P3
P1
Gradient: g = g1 + g2 + g3
g1=c(B1
1, θ0)
g2=c(B2
1, θ0)
g3=c(B3
1, θ0)
g1=c(B1
2, θ1)
g2=c(B2
2, θ1)
g3=c(B3
2, θ1)
Parallel Synchronous Mini-Batch LearningFinkel, Kleeman, and Manning (2008)
Mini-batches:
Processors: Pi
Parameters: θt
Bt = B1t ∪ B2
t ∪ B3
t
θ2
g1=c(B1
3,
g2=c(B2
3,
g3=c(B3
3,
![Page 14: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/14.jpg)
lti
Time
θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
gk
Bj
![Page 15: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/15.jpg)
lti
Time
θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
Bj
![Page 16: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/16.jpg)
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
Bj
![Page 17: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/17.jpg)
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
Bj
![Page 18: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/18.jpg)
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2=u(θ1,g2)
Bj
θ2
![Page 19: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/19.jpg)
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2
g2=c(B5, θ2)θ2=u(θ1,g2)
Bj
![Page 20: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/20.jpg)
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2
g2=c(B5, θ2)θ2=u(θ1,g2)
θ3=u(θ2,g3)
θ3
Bj
![Page 21: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/21.jpg)
lti
θ1=u(θ0,g1)
Time
θ1θ θ0
P2
0
� Gradients computed using stale parameters
� Increased processor utilization� Only idle time caused by lock for
updating parameters
P3
P1
Gradient:
Parallel Asynchronous Mini-Batch LearningNedic, Bertsekas, and Borkar (2001)
Mini-batches:
Processors: Pi
Parameters: θt
g1=c(B1, θ0)
g2=c(B2, θ0)
g3=c(B3, θ0)
gk
g1=c(B4, θ1)
θ2
g2=c(B5, θ2)θ2=u(θ1,g2)
θ3=u(θ2,g3)
θ3
θ4=u(θ3,g1)
θ5=u(θ4
g1=c(B
g3=c(B6, θ3)
Bj
θ4
![Page 22: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/22.jpg)
lti
Theoretical Results
� How does the use of stale parameters affect convergence?
� Convergence results exist for convex optimization using stochastic gradient descent� Convergence guaranteed when max delay is
bounded (Nedic, Bertsekas, and Borkar, 2001)� Convergence rates linear in max delay (Langford,
Smola, and Zinkevich, 2009)
![Page 23: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/23.jpg)
lti
Experiments
4
10k
4
m
HMM
IBM Model 1
CRF
Model
2M42kNStepwise
EM
Unsupervised Part-of-Speech
Tagging
14.2M300kYStepwise
EMWord
Alignment
1.3M15kYStochastic Gradient Descent
Named-Entity Recognition
Convex?MethodTask |θ||D|
� To compare algorithms, we use wall clock time (with a dedicated 4-processor machine)
� m = mini-batch size
![Page 24: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/24.jpg)
lti
Experiments
4
m
CRF
Model
1.3M15kYStochastic Gradient Descent
Named-Entity Recognition
Convex?MethodTask |θ||D|
� CoNLL 2003 English data
� Label each token with entity type (person, location, organization, or miscellaneous) or non-entity
� We show convergence in F1 on development data
![Page 25: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/25.jpg)
lti
0 2 4 6 8 10 1284
85
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous (4 processors)Synchronous (4 processors)Single-processor
Asynchronous Updating Speeds Convergence
All use a mini-batch size of 4
![Page 26: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/26.jpg)
lti
0 2 4 6 8 10 1284
85
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous (4 processors)Ideal
Comparison with Ideal Speed-up
![Page 27: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/27.jpg)
lti
Why Does Asynchronous Converge Faster?
� Processors are kept in near-constant use� Synchronous SGD leads to idle
processors � need for load-balancing
![Page 28: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/28.jpg)
lti
0 1 2 3 4 5 6 7 885
86
87
88
89
90
91
F1
Synchronous (4 processors)Synchronous (2 processors)Single-processor
0 1 2 3 4 5 6 7 885
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous (4 processors)Asynchronous (2 processors)Single-processor
Clearer improvement for asynchronous algorithms when increasing number ofprocessors
![Page 29: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/29.jpg)
lti
0 2 4 6 8 10 1285
86
87
88
89
90
91
Wall clock time (hours)
F1
Asynchronous, no delayAsynchronous, µ = 5Single-processor, no delayAsynchronous, µ = 10Asynchronous, µ = 20
Artificial Delays
After completing a mini-batch, 25% chance of delaying
Delay (in seconds) sampled from max(N (µ, (µ/5)2), 0)
Avg. time per mini-batch = 0.62 s
![Page 30: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/30.jpg)
lti
Experiments
10k
m
IBM Model 1
Model
14.2M300kYStepwise
EMWord
Alignment
Convex?MethodTask
� Given parallel sentences, draw links between words:
� We show convergence in log-likelihood (convergence in AER is similar)
konnten sie es übersetzen ?
could you translate it ?
|θ||D|
![Page 31: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/31.jpg)
lti
Stepwise EM(Sato and Ishii, 2000; Cappe and Moulines, 2009)
� Similar to stochastic gradient descent in the space of sufficient statistics, with a particular scaling of the update
� More efficient than incremental EM
(Neal and Hinton, 1998)� Found to converge much faster than batch EM
(Liang and Klein, 2009)
![Page 32: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/32.jpg)
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
For stepwise EM, mini-batch size = 10,000
![Page 33: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/33.jpg)
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
Asynchronous is no faster than synchronous!
For stepwise EM, mini-batch size = 10,000
![Page 34: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/34.jpg)
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
Asynchronous is no faster than synchronous!
For stepwise EM, mini-batch size = 10,000
![Page 35: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/35.jpg)
lti
Comparing Mini-Batch Sizes
10 20 30 40 50 60 70 80 90 100
-32
-30
-28
-26
-24
-22
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)
![Page 36: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/36.jpg)
lti
Comparing Mini-Batch Sizes
10 20 30 40 50 60 70 80 90 100
-32
-30
-28
-26
-24
-22
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)
Asynchronous is faster when using small mini-batches
![Page 37: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/37.jpg)
lti
Comparing Mini-Batch Sizes
10 20 30 40 50 60 70 80 90 100
-32
-30
-28
-26
-24
-22
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. (m = 10,000)Synch. (m = 10,000)Asynch. (m = 1,000)Synch. (m = 1,000)Asynch. (m = 100)Synch. (m = 100)
Error from asynchronous
updating
![Page 38: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/38.jpg)
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Word Alignment Results
For stepwise EM, mini-batch size = 10,000
![Page 39: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/39.jpg)
lti
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)Ideal
Comparison with Ideal Speed-up
For stepwise EM, mini-batch size = 10,000
![Page 40: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/40.jpg)
lti
MapReduce?
� We also ran these algorithms on a large MapReduce cluster (M45 from Yahoo!)
� Batch EM�Each iteration is one MapReduce job, using
24 mappers and 1 reducer
� Asynchronous Stepwise EM�4 mini-batches processed simultaneously,
each run as a MapReduce job�Each uses 6 mappers and 1 reducer
![Page 41: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/41.jpg)
lti
MapReduce?
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20Lo
g-Li
kelih
ood
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)
![Page 42: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/42.jpg)
lti
MapReduce?
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20Lo
g-Li
kelih
ood
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
10 20 30 40 50 60 70 80
-40
-35
-30
-25
-20
Wall clock time (minutes)
Log-
Like
lihoo
d
Asynch. Stepwise EM (MapReduce)Batch EM (MapReduce)
![Page 43: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/43.jpg)
lti
Experiments
4
m
HMM
Model
2M42kNStepwise
EM
Unsupervised Part-of-Speech
Tagging
Convex?MethodTask |θ||D|
� Bigram HMM with 45 states
� We plot convergence in likelihood and many-to-1 accuracy
![Page 44: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/44.jpg)
lti
0 1 2 3 4 5 6
-7.5
-7
-6.5
-6x 10
6
Log-
Like
lihoo
d
0 1 2 3 4 5 6
50
55
60
65
Wall clock time (hours)
Acc
urac
y (%
)
Asynch. Stepwise EM (4 processors)Synch. Stepwise EM (4 processors)Synch. Stepwise EM (1 processor)Batch EM (1 processor)
Part-of-Speech Tagging Results
mini-batch size = 4 for stepwise EM
![Page 45: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/45.jpg)
lti
0 1 2 3 4 5 6
-7.5
-7
-6.5
-6x 10
6
Log-
Like
lihoo
d
0 1 2 3 4 5 6
50
55
60
65
Wall clock time (hours)
Acc
urac
y (%
)
Asynch. Stepwise EM (4 processors)Ideal
Comparison with Ideal
![Page 46: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/46.jpg)
lti
0 1 2 3 4 5 6
-7.5
-7
-6.5
-6x 10
6
Log-
Like
lihoo
d
0 1 2 3 4 5 6
50
55
60
65
Wall clock time (hours)
Acc
urac
y (%
)
Asynch. Stepwise EM (4 processors)Ideal
Comparison with Ideal
Asynchronous better than ideal?
![Page 47: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/47.jpg)
lti
Conclusions and Future Work
� Asynchronous algorithms speed convergence and do not introduce additional error
� Effective for unsupervised learning and non-convex objectives
� If your problem works well with small mini-batches, try this!
� Future work� Theoretical results for non-convex case� Explore effects of increasing number of processors� New architectures (maintain multiple copies of )θ
![Page 48: Distributed Asynchronous Online Learning for Natural ...ttic.uchicago.edu/~kgimpel/talks/gimpel+das+smith.conll...Asynchronous, no delay Asynchronous, µ = 5 Single-processor, no delay](https://reader034.fdocuments.in/reader034/viewer/2022052015/602cb0874297eb761a148e09/html5/thumbnails/48.jpg)
lti
Thanks!