Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

21
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Page 1: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Distributed Iterative Training

Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith

Page 2: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Outline

• The Problem

• Distributed Architecture

• Experiments and Hadoop Issues

Page 3: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Iterative Training

• Many problems in NLP and machine learning require iterating over large training sets many times– Training log-linear models (logistic regression, conditional

random fields)– Unsupervised or semi-supervised learning with EM (word

alignment in MT, grammar induction)– Minimum Error-Rate Training in MT– *Online learning (MIRA, perceptron, stochastic gradient descent)

• All of the above except * can be easily parallelized– Compute statistics on sections of the data independently– Aggregate them– Update parameters using statistics of full set of data– Repeat until a stopping criterion is met

Page 4: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Dependency Grammar Induction

• Given sentences of natural language text, infer (dependency) parse trees

• State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006)

• This talk: scaling up to more and longer sentences using Hadoop!

Page 5: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Dependency Grammar Induction

• Training– Input is a set of sentences (actually, POS tag sequences) and a

grammar with initial parameter values– Run an iterative optimization algorithm (EM, LBFGS, etc.) that

changes the parameter values on each iteration– Output is a learned set of parameter values

• Testing– Use grammar with learned parameters to parse a small set of

test sentences– Evaluate by computing percentage of predicted edges that

match a human annotator

Page 6: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Outline

• The Problem

• Distributed Architecture

• Experiments and Hadoop Issues

Page 7: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

MapReduce for Grammar Induction

• MapReduce was designed for:– Large amounts of data distributed across

many disks– Simple data processing

• We have:– (Relatively) small amounts of data– Expensive processing and high memory

requirements

Page 8: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

MapReduce for Grammar Induction

• Algorithms require 50-100 iterations for convergence– Each iteration requires a full sweep over all training data– Computational bottleneck is computing expected counts for EM

on each iteration (gradient for LBFGS)

• Our approach: run one MapReduce job for each iteration– Map: compute expected counts (gradient)– Reduce: aggregate– Offline: renormalize (EM) or modify parameter values (LBFGS)

• Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier

Page 9: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

MapReduce Implementation

Map Reduce

Distributed Cache

New Parameter Values:p_root(NN) = -1.91246p_dep(CD | NN, right) = -2.7175p_dep(DT | NN, right) = -3.0648…

Expected Counts:p_root(NN) 0.345p_root(NN) 1.875p_dep(CD | NN, right) 0.175p_dep(CD | NN, right) 0.025p_dep(DT | NN, right) 0.065…

Sentences:[NNP,NNP,VBZ,NNP][DT,JJ,NN,MD,VB,JJ,NNP,CD][DT,NN,NN,VBZ,RB,VBN,VBN]…

Aggregated Expected Counts:p_root(NN) 2.220p_dep(CD | NN, right) 0.200p_dep(DT | NN, right) 0.065…

Server 1. Normalize expected counts to get new parameter values

2. Start new MapReduce job, placing new parameter values on distributed cache

Compute expected counts

Aggregate expected counts

Page 10: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Running ExperimentsWe use streaming for all experiments with 2 C++ programs: server and map

(reduce is a simple summer)

> cd /home/kgimpel/grammar_induction

> hod allocate –d /home/kgimpel/grammar_induction –n 25

> ./dep_induction_server \

input_file=/user/kgimpel/data/train20-20parts \

aux_file=aux.train20 output_file=model.train20 \

hod_config=/home/kgimpel/grammar_induction \

num_reduce_tasks=5 1> stdout 2> stderr

dep_induction_server runs a MapReduce job on each iteration

Input split into pieces for map tasks (dataset too small for default Hadoop splitter)

Page 11: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Outline

• The Problem

• Distributed Architecture

• Experiments and Hadoop Issues

Page 12: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Speed-up with Hadoop

• 38,576 sentences• ≤ 40 words / sent.

• 40 nodes• 5 reduce tasks

• Average iteration time reduced from 2039 s to 115 s

• Total time reduced from 3400 minutes to 200 minutes

0 500 1000 1500 2000 2500 3000 3500-2.2

-2.15

-2.1

-2.05

-2

-1.95

-1.9

-1.85

-1.8x 10

6

Wall Clock Time (minutes)

Log-

Like

lihoo

d

Single node

Hadoop (40 nodes)

Page 13: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Hadoop Issues

1. Overhead of running a single MapReduce job

2. Stragglers in the map phase

Page 14: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration (40 nodes, 38,576 sentences):

Consistent 40-second delay between map and

reduce phases

• 115 s per iteration total• 40+ s per iteration of overhead

• When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up!

3

1of execution time is

overhead!

Page 15: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration (40 nodes, 38,576 sentences):

• 5 reduce tasks used• Reduce phase is simply aggregation of values for 2600 parameters

Why does reduce take

so long?

Page 16: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Histogram of Iteration Times

0 100 200 300 400 5000

100

200

300

400

500

Iteration Time (seconds)

Cou

nt

Mean = ~115 s

Page 17: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Histogram of Iteration Times

0 100 200 300 400 5000

100

200

300

400

500

Iteration Time (seconds)

Cou

nt

What’s going on here?

Mean = ~115 s

Page 18: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

Typical Iteration:

Page 19: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

23:20:27 : map 0% reduce 0%

23:20:34 : map 5% reduce 0%

23:20:35 : map 20% reduce 0%

23:20:36 : map 41% reduce 0%

23:20:37 : map 56% reduce 0%

23:20:38 : map 74% reduce 0%

23:20:39 : map 95% reduce 0%

23:20:40 : map 97% reduce 0%

23:21:32 : map 97% reduce 1%

23:21:37 : map 97% reduce 2%

23:21:42 : map 97% reduce 12%

23:21:43 : map 97% reduce 15%

23:21:47 : map 97% reduce 19%

23:21:50 : map 97% reduce 21%

23:21:52 : map 97% reduce 26%

23:21:57 : map 97% reduce 31%

23:21:58 : map 97% reduce 32%

23:23:46 : map 100% reduce 32%

23:24:54 : map 100% reduce 46%

23:24:55 : map 100% reduce 86%

23:24:56 : map 100% reduce 100%

Typical Iteration: Slow Iteration:

3 minutes waiting for last

map tasksto complete

Page 20: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

23:17:05 : map 0% reduce 0%

23:17:12 : map 3% reduce 0%

23:17:13 : map 26% reduce 0%

23:17:14 : map 49% reduce 0%

23:17:15 : map 66% reduce 0%

23:17:16 : map 72% reduce 0%

23:17:17 : map 97% reduce 0%

23:17:18 : map 100% reduce 0%

23:18:00 : map 100% reduce 1%

23:18:15 : map 100% reduce 2%

23:18:18 : map 100% reduce 4%

23:18:20 : map 100% reduce 15%

23:18:27 : map 100% reduce 17%

23:18:28 : map 100% reduce 18%

23:18:30 : map 100% reduce 23%

23:18:32 : map 100% reduce 100%

23:20:27 : map 0% reduce 0%

23:20:34 : map 5% reduce 0%

23:20:35 : map 20% reduce 0%

23:20:36 : map 41% reduce 0%

23:20:37 : map 56% reduce 0%

23:20:38 : map 74% reduce 0%

23:20:39 : map 95% reduce 0%

23:20:40 : map 97% reduce 0%

23:21:32 : map 97% reduce 1%

23:21:37 : map 97% reduce 2%

23:21:42 : map 97% reduce 12%

23:21:43 : map 97% reduce 15%

23:21:47 : map 97% reduce 19%

23:21:50 : map 97% reduce 21%

23:21:52 : map 97% reduce 26%

23:21:57 : map 97% reduce 31%

23:21:58 : map 97% reduce 32%

23:23:46 : map 100% reduce 32%

23:24:54 : map 100% reduce 46%

23:24:55 : map 100% reduce 86%

23:24:56 : map 100% reduce 100%

Typical Iteration: Slow Iteration:

3 minutes waiting for last

map tasksto complete

Suggestions?(Doesn’t Hadoop replicate map tasks to avoid this?)

Page 21: Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Questions?