Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Distributed Iterative Training

Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith

Outline

• The Problem

• Distributed Architecture

• Experiments and Hadoop Issues

Iterative Training

• Many problems in NLP and machine learning require iterating over large training sets many times– Training log-linear models (logistic regression, conditional

random fields)– Unsupervised or semi-supervised learning with EM (word

alignment in MT, grammar induction)– Minimum Error-Rate Training in MT– *Online learning (MIRA, perceptron, stochastic gradient descent)

• All of the above except * can be easily parallelized– Compute statistics on sections of the data independently– Aggregate them– Update parameters using statistics of full set of data– Repeat until a stopping criterion is met

Dependency Grammar Induction

• Given sentences of natural language text, infer (dependency) parse trees

• State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006)

• This talk: scaling up to more and longer sentences using Hadoop!

Dependency Grammar Induction

• Training– Input is a set of sentences (actually, POS tag sequences) and a

grammar with initial parameter values– Run an iterative optimization algorithm (EM, LBFGS, etc.) that

changes the parameter values on each iteration– Output is a learned set of parameter values

• Testing– Use grammar with learned parameters to parse a small set of

test sentences– Evaluate by computing percentage of predicted edges that

match a human annotator

Outline

• The Problem

MapReduce for Grammar Induction

• MapReduce was designed for:– Large amounts of data distributed across

many disks– Simple data processing

• We have:– (Relatively) small amounts of data– Expensive processing and high memory

requirements

MapReduce for Grammar Induction

• Algorithms require 50-100 iterations for convergence– Each iteration requires a full sweep over all training data– Computational bottleneck is computing expected counts for EM

on each iteration (gradient for LBFGS)

• Our approach: run one MapReduce job for each iteration– Map: compute expected counts (gradient)– Reduce: aggregate– Offline: renormalize (EM) or modify parameter values (LBFGS)

• Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier

MapReduce Implementation

Map Reduce

Distributed Cache

New Parameter Values:p_root(NN) = -1.91246p_dep(CD | NN, right) = -2.7175p_dep(DT | NN, right) = -3.0648…

Expected Counts:p_root(NN) 0.345p_root(NN) 1.875p_dep(CD | NN, right) 0.175p_dep(CD | NN, right) 0.025p_dep(DT | NN, right) 0.065…

Sentences:[NNP,NNP,VBZ,NNP][DT,JJ,NN,MD,VB,JJ,NNP,CD][DT,NN,NN,VBZ,RB,VBN,VBN]…

Aggregated Expected Counts:p_root(NN) 2.220p_dep(CD | NN, right) 0.200p_dep(DT | NN, right) 0.065…

Server 1. Normalize expected counts to get new parameter values

2. Start new MapReduce job, placing new parameter values on distributed cache

Compute expected counts

Aggregate expected counts

Running ExperimentsWe use streaming for all experiments with 2 C++ programs: server and map

(reduce is a simple summer)

> cd /home/kgimpel/grammar_induction

> hod allocate –d /home/kgimpel/grammar_induction –n 25

> ./dep_induction_server \

input_file=/user/kgimpel/data/train20-20parts \

aux_file=aux.train20 output_file=model.train20 \

hod_config=/home/kgimpel/grammar_induction \

num_reduce_tasks=5 1> stdout 2> stderr

dep_induction_server runs a MapReduce job on each iteration

Input split into pieces for map tasks (dataset too small for default Hadoop splitter)

Outline

• The Problem

Speed-up with Hadoop

• 38,576 sentences• ≤ 40 words / sent.

• 40 nodes• 5 reduce tasks

• Average iteration time reduced from 2039 s to 115 s

• Total time reduced from 3400 minutes to 200 minutes

0 500 1000 1500 2000 2500 3000 3500-2.2

-1.8x 10

Wall Clock Time (minutes)

Single node

Hadoop (40 nodes)

Hadoop Issues

1. Overhead of running a single MapReduce job

2. Stragglers in the map phase

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration (40 nodes, 38,576 sentences):

Consistent 40-second delay between map and

reduce phases

• 115 s per iteration total• 40+ s per iteration of overhead

• When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up!

1of execution time is

overhead!

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration (40 nodes, 38,576 sentences):

• 5 reduce tasks used• Reduce phase is simply aggregation of values for 2600 parameters

Why does reduce take

so long?

Histogram of Iteration Times

0 100 200 300 400 5000

Iteration Time (seconds)

Mean = ~115 s

Histogram of Iteration Times

0 100 200 300 400 5000

Iteration Time (seconds)

What’s going on here?

Mean = ~115 s

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration:

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

23:20:27 : map 0% reduce 0%

23:20:34 : map 5% reduce 0%

23:20:35 : map 20% reduce 0%

23:20:36 : map 41% reduce 0%

23:20:37 : map 56% reduce 0%

23:20:38 : map 74% reduce 0%

23:20:39 : map 95% reduce 0%

23:20:40 : map 97% reduce 0%

23:21:32 : map 97% reduce 1%

23:21:37 : map 97% reduce 2%

23:21:42 : map 97% reduce 12%

23:21:43 : map 97% reduce 15%

23:21:47 : map 97% reduce 19%

23:21:50 : map 97% reduce 21%

23:21:52 : map 97% reduce 26%

23:21:57 : map 97% reduce 31%

23:21:58 : map 97% reduce 32%

23:23:46 : map 100% reduce 32%

23:24:54 : map 100% reduce 46%

23:24:55 : map 100% reduce 86%

23:24:56 : map 100% reduce 100%

Typical Iteration: Slow Iteration:

3 minutes waiting for last

map tasksto complete

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

23:20:27 : map 0% reduce 0%

23:20:34 : map 5% reduce 0%

23:20:35 : map 20% reduce 0%

23:20:36 : map 41% reduce 0%

23:20:37 : map 56% reduce 0%

23:20:38 : map 74% reduce 0%

23:20:39 : map 95% reduce 0%

23:20:40 : map 97% reduce 0%

23:21:32 : map 97% reduce 1%

23:21:37 : map 97% reduce 2%

23:21:42 : map 97% reduce 12%

23:21:43 : map 97% reduce 15%

23:21:47 : map 97% reduce 19%

23:21:50 : map 97% reduce 21%

23:21:52 : map 97% reduce 26%

23:21:57 : map 97% reduce 31%

23:21:58 : map 97% reduce 32%

23:23:46 : map 100% reduce 32%

23:24:54 : map 100% reduce 46%

23:24:55 : map 100% reduce 86%

23:24:56 : map 100% reduce 100%

Typical Iteration: Slow Iteration:

3 minutes waiting for last

map tasksto complete

Suggestions?(Doesn’t Hadoop replicate map tasks to avoid this?)

Questions?

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Documents

Transcript of Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

SHAY - MICH. CAL #5 (3 CL.) sheets/AT 20p3 SHAY 3CYL No5 (SPEC SH… · SHAY - MICH. CAL #5 (3 CL.) SCL mm Shay locomotives were developed by Ephraim Shay. His first suc - cessful

OOTTV - BFIM · 2017-02-22 · OOTTV Gimpel™ Oil Operated Trip Throttle Valve (Inverted Globe Body) Dresser-Rand acquired the Gimpel valve business in April, 2007. Gimpel products

Drobeta Turnu Severin - Mehedinti, Romania

bunglo. by shay spaniola

Shay Murtag Precast Beams

Member Outreach - Shay Hata

Gimpel Valves Brochure - 85225 - 85225 GimpelBro · non-return valves for steam turbine generators and API 611 and 612 steam turbine drives. As protection and safety ... Gimpel Valves

20181217 Gimpel Wisconsin Expert Report...Expert Report of James Gimpel ... University of Chicago in 1990. My areas of specialization include political behavior, political geography,

Concavity and Initialization for Unsupervised Dependency …ttic.uchicago.edu/~kgimpel/talks/gimpel+smith.naacl12b...Unsupervised Dependency Parsing Kevin Gimpel Noah A. Smith 1 lti

Severin de Courten

Part-of-Speech Tagging for Twitter: Annotation, …kgimpel/talks/gimpel+etal.acl...lti Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments Kevin Gimpel, Nathan

Severin Image Viewer

Drobeta turnu severin

Wetlands (Severin)

Shay 2 cylinder LS manual 20070417 - accucraft 20p3 Shay 2 Cyl LS Manual.pdf · 2020-02-04 · Shay locomotives were developed by Ephraim Shay. His first successful engine that we

Severin Reference List 2016

Shay wholesale Look Book

Severin catalogue english

Gimpel Electrics · Request an Electrician! Body Corporate, Real Estate & Strata Electrical Services ... Gimpel Electrics is dedicated to servicing the local and greater Brisbane

Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith