Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Distributed Iterative Training

Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith

Outline

• The Problem

• Distributed Architecture

• Experiments and Hadoop Issues

Iterative Training

• Many problems in NLP and machine learning require iterating over large training sets many times– Training log-linear models (logistic regression, conditional

random fields)– Unsupervised or semi-supervised learning with EM (word

alignment in MT, grammar induction)– Minimum Error-Rate Training in MT– *Online learning (MIRA, perceptron, stochastic gradient descent)

• All of the above except * can be easily parallelized– Compute statistics on sections of the data independently– Aggregate them– Update parameters using statistics of full set of data– Repeat until a stopping criterion is met

Dependency Grammar Induction

• Given sentences of natural language text, infer (dependency) parse trees

• State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006)

• This talk: scaling up to more and longer sentences using Hadoop!

Dependency Grammar Induction

• Training– Input is a set of sentences (actually, POS tag sequences) and a

grammar with initial parameter values– Run an iterative optimization algorithm (EM, LBFGS, etc.) that

changes the parameter values on each iteration– Output is a learned set of parameter values

• Testing– Use grammar with learned parameters to parse a small set of

test sentences– Evaluate by computing percentage of predicted edges that

match a human annotator

Outline

• The Problem



MapReduce for Grammar Induction

• MapReduce was designed for:– Large amounts of data distributed across

many disks– Simple data processing

• We have:– (Relatively) small amounts of data– Expensive processing and high memory

requirements

MapReduce for Grammar Induction

• Algorithms require 50-100 iterations for convergence– Each iteration requires a full sweep over all training data– Computational bottleneck is computing expected counts for EM

on each iteration (gradient for LBFGS)

• Our approach: run one MapReduce job for each iteration– Map: compute expected counts (gradient)– Reduce: aggregate– Offline: renormalize (EM) or modify parameter values (LBFGS)

• Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier

MapReduce Implementation

Map Reduce

Distributed Cache

New Parameter Values:p_root(NN) = -1.91246p_dep(CD | NN, right) = -2.7175p_dep(DT | NN, right) = -3.0648…

Expected Counts:p_root(NN) 0.345p_root(NN) 1.875p_dep(CD | NN, right) 0.175p_dep(CD | NN, right) 0.025p_dep(DT | NN, right) 0.065…

Sentences:[NNP,NNP,VBZ,NNP][DT,JJ,NN,MD,VB,JJ,NNP,CD][DT,NN,NN,VBZ,RB,VBN,VBN]…

Aggregated Expected Counts:p_root(NN) 2.220p_dep(CD | NN, right) 0.200p_dep(DT | NN, right) 0.065…

Server 1. Normalize expected counts to get new parameter values

2. Start new MapReduce job, placing new parameter values on distributed cache

Compute expected counts

Aggregate expected counts

Running ExperimentsWe use streaming for all experiments with 2 C++ programs: server and map

(reduce is a simple summer)

> cd /home/kgimpel/grammar_induction

> hod allocate –d /home/kgimpel/grammar_induction –n 25

> ./dep_induction_server \

input_file=/user/kgimpel/data/train20-20parts \

aux_file=aux.train20 output_file=model.train20 \

hod_config=/home/kgimpel/grammar_induction \

num_reduce_tasks=5 1> stdout 2> stderr

dep_induction_server runs a MapReduce job on each iteration

Input split into pieces for map tasks (dataset too small for default Hadoop splitter)

Outline

• The Problem



Speed-up with Hadoop

• 38,576 sentences• ≤ 40 words / sent.

• 40 nodes• 5 reduce tasks

• Average iteration time reduced from 2039 s to 115 s

• Total time reduced from 3400 minutes to 200 minutes

0 500 1000 1500 2000 2500 3000 3500-2.2

-2.15

-2.1

-2.05

-2

-1.95

-1.9

-1.85

-1.8x 10

6

Wall Clock Time (minutes)

Log-

Like

lihoo

d

Single node

Hadoop (40 nodes)

Hadoop Issues

1. Overhead of running a single MapReduce job

2. Stragglers in the map phase

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration (40 nodes, 38,576 sentences):

Consistent 40-second delay between map and

reduce phases

• 115 s per iteration total• 40+ s per iteration of overhead

• When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up!

3

1of execution time is

overhead!

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration (40 nodes, 38,576 sentences):

• 5 reduce tasks used• Reduce phase is simply aggregation of values for 2600 parameters

Why does reduce take

so long?

Histogram of Iteration Times

0 100 200 300 400 5000

100

200

300

400

500

Iteration Time (seconds)

Cou

nt

Mean = ~115 s

Histogram of Iteration Times

0 100 200 300 400 5000

100

200

300

400

500

Iteration Time (seconds)

Cou

nt

What’s going on here?

Mean = ~115 s

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration:

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

23:20:27 : map 0% reduce 0%

23:20:34 : map 5% reduce 0%

23:20:35 : map 20% reduce 0%

23:20:36 : map 41% reduce 0%

23:20:37 : map 56% reduce 0%

23:20:38 : map 74% reduce 0%

23:20:39 : map 95% reduce 0%

23:20:40 : map 97% reduce 0%

23:21:32 : map 97% reduce 1%

23:21:37 : map 97% reduce 2%

23:21:42 : map 97% reduce 12%

23:21:43 : map 97% reduce 15%

23:21:47 : map 97% reduce 19%

23:21:50 : map 97% reduce 21%

23:21:52 : map 97% reduce 26%

23:21:57 : map 97% reduce 31%

23:21:58 : map 97% reduce 32%

23:23:46 : map 100% reduce 32%

23:24:54 : map 100% reduce 46%

23:24:55 : map 100% reduce 86%

23:24:56 : map 100% reduce 100%

Typical Iteration: Slow Iteration:

3 minutes waiting for last

map tasksto complete

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

23:20:27 : map 0% reduce 0%

23:20:34 : map 5% reduce 0%

23:20:35 : map 20% reduce 0%

23:20:36 : map 41% reduce 0%

23:20:37 : map 56% reduce 0%

23:20:38 : map 74% reduce 0%

23:20:39 : map 95% reduce 0%

23:20:40 : map 97% reduce 0%

23:21:32 : map 97% reduce 1%

23:21:37 : map 97% reduce 2%

23:21:42 : map 97% reduce 12%

23:21:43 : map 97% reduce 15%

23:21:47 : map 97% reduce 19%

23:21:50 : map 97% reduce 21%

23:21:52 : map 97% reduce 26%

23:21:57 : map 97% reduce 31%

23:21:58 : map 97% reduce 32%

23:23:46 : map 100% reduce 32%

23:24:54 : map 100% reduce 46%

23:24:55 : map 100% reduce 86%

23:24:56 : map 100% reduce 100%

Typical Iteration: Slow Iteration:

3 minutes waiting for last

map tasksto complete

Suggestions?(Doesn’t Hadoop replicate map tasks to avoid this?)

Questions?

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Documents

Transcript of Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.