SMT Model Building with MapReduce

SMT Model Building with MapReduce

Chris Dyer, Alex Mont, Aaron Cordova, Jimmy Lin

A brief intoduction

Statistical machine translation in a (idealized) nutshell:

We consider two decompositions:-Word-based models (used for word alignment)-Phrase-based models (used for translation)

A brief intoduction

Statistical machine translation in a (idealized) nutshell:

We consider two decompositions:-Word-based models (used for word alignment)-Phrase-based models (used for translation)

How do we estimate the parameters efficiently?

Outline

• MapReduce

• SMT & the SMT pipeline

• MapReducing relative frequencies

• Experimental results

• Future directions

MapReduce

input input input input

map map map map

input input input input

Barrier: group values by keys

reduce reduce reduce

output output output

Figure 2: Illustration of the MapReduce framework: the“mapper” is applied to all input records, which generatesresults that are aggregated by the “reducer”.

Key/value pairs form the basic data structure inMapReduce. The “mapper” is applied to every inputkey/value pair to generate an arbitrary number of in-termediate key/value pairs. The “reducer” is appliedto all values associated with the same intermediatekey to generate output key/value pairs. This two-stage processing structure is illustrated in Figure 2.

Under this framework, a programmer need onlyprovide implementations of map and reduce. On topof a distributed file system (Ghemawat et al., 2003),the runtime transparently handles all other aspectsof execution, on clusters ranging from a few to a fewthousand workers on commodity hardware assumedto be unreliable, and thus is tolerant to various faultsthrough a number of error recovery mechanisms.The runtime also manages data exchange, includ-ing splitting the input across multiple map workersand the potentially very large sorting problem be-tween the map and reduce phases whereby interme-diate key/value pairs must be grouped by key.

For the MapReduce experiments reported in thispaper, we used Hadoop version 0.16.0,3 which isan open-source Java implementation of MapRe-duce, running on a 20-machine cluster (1 master,19 slaves). Each machine has two processors (run-ning at either 2.4GHz or 2.8GHz), 4GB memory(map and reduce tasks were limited to 768MB), and100GB disk. All software was implemented in Java.

3http://hadoop.apache.org/

Method 1Map1 〈A, B〉 → 〈〈A, B〉, 1〉Reduce1 〈〈A, B〉, c(A, B)〉Map2 〈〈A, B〉, c(A, B)〉 → 〈〈A,∗ 〉, c(A, B)〉Reduce2 〈〈A,∗ 〉, c(A)〉Map3 〈〈A, B〉, c(A, B)〉 → 〈A, 〈B, c(A, B)〉〉Reduce3 〈A, 〈B, c(A,B)

c(A) 〉〉

Method 2Map1 〈A, B〉 → 〈〈A, B〉, 1〉; 〈〈A,∗ 〉, 1〉Reduce1 〈〈A, B〉, c(A,B)

c(A) 〉

Method 3Map1 〈A, Bi〉 → 〈A, 〈Bi : 1〉〉Reduce1 〈A, 〈B1 : c(A,B1)

c(A) 〉, 〈B2 : c(A,B2)c(A) 〉 ···〉

Table 1: Three methods for computing PMLE(B|A).The first element in each tuple is a key and the secondelement is the associated value produced by the mappersand reducers.

3 Maximum Likelihood Estimates

The two classes of models under consideration areparameterized with conditional probability distribu-tions over discrete events, generally estimated ac-cording to the maximum likelihood criterion:

PMLE(B|A) =c(A, B)c(A)

=c(A, B)∑B′ c(A, B′)

(1)

Since this calculation is fundamental to both ap-proaches (they distinguish themselves only by wherethe counts of the joint events come from—in the caseof the phrase model, they are observed directly, andin the case of the word-alignment models they arethe number of expected events in a partially hiddenprocess given an existing model of that process), webegin with an overview of how to compute condi-tional probabilities in MapReduce.

We consider three possible solutions to this prob-lem, shown in Table 1. Method 1 computes the countfor each pair 〈A, B〉, computes the marginal c(A),and then groups all the values for a given A together,such that the marginal is guaranteed to be first andthen the pair counts follow. This enables Reducer3to only hold the marginal value in memory as it pro-cesses the remaining values. Method 2 works simi-larly, except that the original mapper emits two val-ues for each pair 〈A, B〉 that is encountered: one that

for a rewrite of our production indexing system. Sec-tion 7 discusses related and future work.

2 Programming Model

The computation takes a set of input key/value pairs, andproduces a set of output key/value pairs. The user ofthe MapReduce library expresses the computation as twofunctions: Map and Reduce.Map, written by the user, takes an input pair and pro-duces a set of intermediate key/value pairs. The MapRe-duce library groups together all intermediate values asso-ciated with the same intermediate key I and passes themto the Reduce function.The Reduce function, also written by the user, acceptsan intermediate key I and a set of values for that key. Itmerges together these values to form a possibly smallerset of values. Typically just zero or one output value isproduced per Reduce invocation. The intermediate val-ues are supplied to the user’s reduce function via an iter-ator. This allows us to handle lists of values that are toolarge to fit in memory.

2.1 ExampleConsider the problem of counting the number of oc-currences of each word in a large collection of docu-ments. The user would write code similar to the follow-ing pseudo-code:

map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

Emit(AsString(result));

The map function emits each word plus an associatedcount of occurrences (just ‘1’ in this simple example).The reduce function sums together all counts emittedfor a particular word.In addition, the user writes code to fill in a mapreducespecification object with the names of the input and out-put files, and optional tuning parameters. The user theninvokes the MapReduce function, passing it the specifi-cation object. The user’s code is linked together with theMapReduce library (implemented in C++). Appendix Acontains the full program text for this example.

2.2 Types

Even though the previous pseudo-code is written in termsof string inputs and outputs, conceptually the map andreduce functions supplied by the user have associatedtypes:map (k1,v1) → list(k2,v2)reduce (k2,list(v2)) → list(v2)

I.e., the input keys and values are drawn from a differentdomain than the output keys and values. Furthermore,the intermediate keys and values are from the same do-main as the output keys and values.Our C++ implementation passes strings to and fromthe user-defined functions and leaves it to the user codeto convert between strings and appropriate types.

2.3 More Examples

Here are a few simple examples of interesting programsthat can be easily expressed as MapReduce computa-tions.

Distributed Grep: The map function emits a line if itmatches a supplied pattern. The reduce function is anidentity function that just copies the supplied intermedi-ate data to the output.

Count of URL Access Frequency: The map func-tion processes logs of web page requests and outputs〈URL,1〉. The reduce function adds together all valuesfor the same URL and emits a 〈URL,total count〉pair.

Reverse Web-Link Graph: The map function outputs〈target,source〉 pairs for each link to a targetURL found in a page named source. The reducefunction concatenates the list of all source URLs as-sociated with a given target URL and emits the pair:〈target, list(source)〉

Term-Vector per Host: A term vector summarizes themost important words that occur in a document or a setof documents as a list of 〈word, frequency〉 pairs. Themap function emits a 〈hostname,term vector〉pair for each input document (where the hostname isextracted from the URL of the document). The re-duce function is passed all per-document term vectorsfor a given host. It adds these term vectors together,throwing away infrequent terms, and then emits a final〈hostname,term vector〉 pair.

To appear in OSDI 2004 2

User supplies these functions:

MapReduce: example

Count the words: “Hello, world!” for MapReduce

Map(input): for each w in input: emit <w,1>

MapReduce: example

Count the words: “Hello, world!” for MapReduce

Map(input): for each w in input: emit <w,1>

Reduce(key, values): sum = 0 for each val in values: sum += val emit(key, sum)

MapReduce

• Benefits

• Highly scalable

• Fault tolerant

• Hides details of concurrence from user

• Runs on commodity hardware

• Store massive logical files across small disks!

The Phrase-Based SMT Pipeline

parallel text


1. alignment modeling

parallel text word alignment



parallel text word alignment phrase table

2. phrase extraction and scoring





decoder





decoder

language model





decoderη συσκευή μου δεν λειτουργεί ...

language model

my machine is not working ...






language model


1.2s / sent






language model


26h17m 48h06m

1.2s / sent

Is less data the answer?

15 min

30 min45 min

1.5 hrs

3 hrs

6 hrs

12 hrs

1 day

2 days

10000 100000 1e+06 1e+07 0.3

0.35

0.4

0.45

0.5

0.55

0.6T

ime (

seconds)

Tra

nsla

tion q

ualit

y (

BLE

U)

Corpus size (sentences)

Training timeTranslation quality

Timing experiments conducted on Arabic-English training corpora publicly available from the LDC. Test set is the NIST MT03 evaluation set.

Is less data the answer?

15 min

30 min45 min

1.5 hrs

3 hrs

6 hrs

12 hrs

1 day

2 days

10000 100000 1e+06 1e+07 0.3

0.35

0.4

0.45

0.5

0.55

0.6T

ime (

seconds)

Tra

nsla

tion q

ualit

y (

BLE

U)


Training timeTranslation quality

Probably not...

Timing experiments conducted on Arabic-English training corpora publicly available from the LDC. Test set is the NIST MT03 evaluation set.

Building a phrase table

will be the marginal and one that contributes to thepair count. The reducer groups all pairs together bythe A value, processes the marginal first, and, likeMethod 1, must only keep this value in memory asit processes the remaining pair counts. Method 2 re-quires more data to be processed by the MapReduceframework, but only requires a single sort operation(i.e., fewer MapReduce iterations).

Method 3 works slightly differently: rather thancomputing the pair counts independently of eachother, the counts of all the B events jointly occurringwith a particular A = a event are stored in an asso-ciative data structure in memory in the reducer. Themarginal c(A) can be computed by summing overall the values in the associative data structure andthen a second pass normalizes. This requires thatthe conditional distribution P (B|A = a) not haveso many parameters that it cannot be representedin memory. A potential advantage of this approachis that the MapReduce framework can use a “com-biner” to group many 〈A, B〉 pairs into a single valuebefore the key/value pair leaves for the reducer.4 Ifthe underlying distribution from which pairs 〈A, B〉has certain characteristics, this can result in a signifi-cant reduction in the number of keys that the mapperemits (although the number of statistics will be iden-tical). And since all keys must be sorted prior to thereducer step beginning, reducing the number of keyscan have significant performance impact.

The graph in Figure 3 shows the performanceof the three problem decompositions on two modeltypes we are estimating, conditional phrase trans-lation probabilities (1.5M sentences, max phraselength=7), and conditional lexical translation prob-abilities as found in a word alignment model (500ksentences). In both cases, Method 3, which makesuse of more memory to store counts of all B eventsassociated with event A = a, completes at least 50%more quickly. This efficiency is due to the Zipfiandistribution of both phrases and lexical items in ourcorpora: a few frequent items account for a largeportion of the corpus. The memory requirementswere also observed to be quite reasonable for the

4Combiners operate like reducers, except they run directlyon the output of a mapper before the results leave memory.They can be used when the reduction operation is associativeand commutative. For more information refer to Dean and Ghe-mawat (2004).

0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method

Phrase pairsWord pairs

Figure 3: PMLE computation strategies.

Figure 4: A word-aligned sentence. Examplesof consistent phrase pairs include 〈vi, i saw〉,〈la mesa pequena, the small table〉, and〈mesa pequena, small table〉; but, note that, forexample, it is not possible to extract a consistent phrasecorresponding to the foreign string la mesa or the Englishstring the small.

models in question: representing P (B|A = a) in thephrase model required at most 90k parameters, andin the lexical model, 128k parameters (i.e., the sizeof the vocabulary for language B). For the remainderof the experiments reported, we confine ourselves tothe use of Method 3.

4 Phrase-Based Translation

In phrase-based translation, the translation processis modeled by splitting the source sentence intophrases (a contiguous string of words) and translat-ing the phrases as a unit (Och et al., 1999; Koehnet al., 2003). Phrases are extracted from a word-aligned parallel sentence according to the strategyproposed by Och et al. (1999), where every word ina phrase is aligned only to other words in the phrase,and not to any words outside the phrase bounds. Fig-ure 4 shows an example aligned sentence and someof the consistent subphrases that may be extracted.

Step 1: extract phrases from a word aligned parallel text






0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la> <small, pequeña>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la> <small, pequeña><table, mesa>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la> <small, pequeña><table, mesa> <i saw the, vi la>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la> <small, pequeña><table, mesa> <i saw the, vi la><small table, mesa pequeña>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la> <small, pequeña><table, mesa> <i saw the, vi la><small table, mesa pequeña><the small table, la mesa pequeña>







0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







<i saw, vi> <the, la> <small, pequeña><table, mesa> <i saw the, vi la><small table, mesa pequeña><the small table, la mesa pequeña><i saw the small table, vi la mesa pequeña>


Building a phrase table<i saw, vi> <the, la> <small, pequeña><table, mesa> <i saw the, vi la><small table, mesa pequeña><the small table, la mesa pequeña><i saw the small table, vi la mesa pequeña>...

Step 2: compute joint counts

<i saw, vi> 15<i saw the, vi la> 5<small, pequeña> 72<the, la> 5434<the, el> 6218<table, mesa> 2...


Step 2: compute marginal counts


<i saw, *> 15<i saw the, *> 5<small, *> 72<the, *> 11652<table, *> 2...


Step 2: join and normalize


<i saw, *> 15<i saw the, *> 5<small, *> 72<the, *> 11652<table, *> 2...

<i saw, vi> 1.0<i saw the, vi la> 1.0<small, pequeña> 1.0<the, la> 0.47<the, el> 0.53<table, mesa> 1.0...

MapReduce

• Phrase translation probabilities are just relative frequencies f(e|f)

• Relative frequencies can be estimated using MapReduce.

• Why MapReduce?

• Easy parallelization across many machines

• No expensive infrastructure required

Computing Relative Frequencies

!"#$% !"#$% !"#$% !"#$%

&'# &'# &'# &'#

!"#$% !"#$% !"#$% !"#$%

('))!*)+!,)-$#!.'/$*0!12!3*20

)*4$5* )*4$5* )*4$5*

-$%#$% -$%#$% -$%#$%







c(A) 〉〉


c(A) 〉


c(A) 〉, 〈B2 : c(A,B2)c(A) 〉 · · · 〉





=c(A, B)∑B′ c(A, B′)

(1)



!"#$% !"#$% !"#$% !"#$%

&'# &'# &'# &'#

!"#$% !"#$% !"#$% !"#$%

('))!*)+!,)-$#!.'/$*0!12!3*20

)*4$5* )*4$5* )*4$5*

-$%#$% -$%#$% -$%#$%







c(A) 〉〉


c(A) 〉


c(A) 〉, 〈B2 : c(A,B2)c(A) 〉 · · · 〉





=c(A, B)∑B′ c(A, B′)

(1)



Method 1

Corpus: <a,1> <a,2> <b,1> <a,2> <a,3> <b,4> <a,2> <b,1> <a,2> <b,1> <a,1>

1 2 3 4 Σab

Mapper counts joint events.

Method 1


1 2 3 4 Σab

2

Reducer computes counts.

Method 1


1 2 3 4 Σab

2 4


Method 1


1 2 3 4 Σab

2 4 1


Method 1


1 2 3 4 Σab

2 4 1

3


Method 1


1 2 3 4 Σab

2 4 1

3 1


Method 1


1 2 3 4 Σab

2 4 1

3 1

A second reducer computes marginals.

Method 1


1 2 3 4 Σab

2 4 1 7

3 1


Methods 1/2


1 2 3 4 Σab

2 4 1 7

3 1 4


Methods 1/2


1 2 3 4 Σab

2 4 1 7

3 1 4


Alternative: mappers emits marginal countstoo for each event, a single reducer computes

Methods 1/2


1 2 3 4 Σab

0.3 4 1 7

3 1 4

Reducer can sort marginal before all other sums and normalize, one cell at a time.

Methods 1/2


1 2 3 4 Σab

0.3 0.6 1 7

3 1 4


Methods 1/2


1 2 3 4 Σab

0.3 0.6 0.1 7

3 1 4


Methods 1/2


1 2 3 4 Σab

0.3 0.6 0.1 7

0.8 1 4


Methods 1/2


1 2 3 4 Σab

0.3 0.6 0.1 7

0.8 0.2 4


Methods 1/2


1 2 3 4 Σab

0.3 0.6 0.1 7

0.8 0.2 4

The join is a very large, expensive sort. Can we do better?

Methods 1/2


1 2 3 4 Σab

0.3 0.6 0.1 7

0.8 0.2 4

The join is a very large, expensive sort. Can we do better?

Yes - if the CPDs we are estimating have few parameters...

Method 3


1 2 3 4 Σab

Method 3


1 2 3 4 Σab

If memory allows, each reducer job counts, marginalizes, and normalizes.

Method 3


1 2 3 4 Σab

0.3 0.6 0.1 7

If memory allows, one reducer counts, marginalizes, and normalizes.

Method 3


1 2 3 4 Σab

0.3 0.6 0.1 7

0.8 0.2 4

If memory allows, one reducer counts, marginalizes, and normalizes.

Method 3


1 2 3 4 Σab

0.3 0.6 0.1 7

0.8 0.2 4

Rather than sorting keys from V1xV2,we just sort over item into bins from V2

Computing Relative Frequency


Computing Relative Frequency

Method 1 Method 2 Method 30

375

750

1,125

1,500


MapReduce Phrase-table building

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

se

co

nd

s)


Moses training timeMapReduce training (38 M/R)

Optimal (Moses/38)

Figure 5: Phrase model extraction and scoring times atvarious corpus sizes.

Constructing a model involves extracting all thephrase pairs 〈e, f〉 and computing the conditionalphrase translation probabilities in both directions.5

With a minor adjustment to the techniques intro-duced in Section 3, it is possible to estimate P (B|A)and P (A|B) concurrently.

Figure 5 shows the time it takes to constructa phrase-based translation model using the Mosestool, running on a single core, as well as the timeit takes to build the same model using our MapRe-duce implementation. For reference, on the samegraph we plot a hypothetical, optimally-parallelizedversion of Moses, which would run in 1

38 of the timerequired for the single-core version on our cluster.6

Although these represent completely different im-plementations, this comparison offers a sense ofMapReduce’s benefits. The framework provides aconceptually simple solution to the problem, whileproviding an implementation that is both scalableand fault tolerant—in fact, transparently so sincethe runtime hides all these complexities from the re-searcher. From the graph it is clear that the overheadassociated with the framework itself is quite low, es-pecially for large quantities of data. We concede thatit may be possible for a custom solution (e.g., withMPI) to achieve even faster running times, but weargue that devoting resources to developing such asolution would not be cost-effective.

Next, we explore a class of models where the stan-5Following Och and Ney (2002), it is customary to combine

both these probabilities as feature values in a log-linear model.6In our cluster, only 19 machines actually compute, and each

has two single-core processors.

dard tools work primarily in memory, but where thecomputational complexity of the models is greater.

5 Word Alignment

Although word-based translation models have beenlargely supplanted by models that make use of largertranslation units, the task of generating a word align-ment, the mapping between the words in the sourceand target sentences that are translationally equiva-lent, remains crucial to nearly all approaches to sta-tistical machine translation.

The IBM models, together with a Hidden MarkovModel (HMM), form a class of generative mod-els that are based on a lexical translation modelP (fj |ei) where each word fj in the foreign sentencefm1 is generated by precisely one word ei in the sen-

tence el1, independently of the other translation de-

cisions (Brown et al., 1993; Vogel et al., 1996; Ochand Ney, 2000). Given these assumptions, we letthe sentence translation probability be mediated bya latent alignment variable (am

1 in the equations be-low) that specifies the pairwise mapping betweenwords in the source and target languages. Assum-ing a given sentence length m for fm

1 , the translationprobability is defined as follows:

P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )

Once the model parameters have been estimated, thesingle-best word alignment is computed accordingto the following decision rule:

am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )

In this section, we consider the MapReduce imple-mentation of two specific alignment models:

1. IBM Model 1, where P (am1 |el

1, fm1 ) is uniform

over all possible alignments.

2. The HMM alignment model whereP (am

1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).






language model


26h17m 48h06m

1.2s / sent






language model


26h17m 48h06m

1.2s / sent

1h58m

Word alignment





0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







To build our models, we need this:

But, the alignment points aren’t given...

Word alignment





0

200

400

600

800

1000

1200

1400

1600

Meth

od 1

Meth

od 2

Math

od 3

Tim

e (

seconds)

Estimation method







To build our models, we need this:

But, the alignment points aren’t given...

EM to the rescue!

Generative alignment models: a brief intro

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

seconds)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).


1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

seconds)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).

Assume a lexical model!


1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

seconds)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).

Still too complicated, so we make one of two further assumptions:

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

se

co

nd

s)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).(HMM)

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

se

co

nd

s)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).uniform(IBM Model 1)


1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

seconds)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

se

co

nd

s)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).(HMM)

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

se

co

nd

s)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).uniform(IBM Model 1)

Once we have such a model, computing the Viterbi alignment is simply:

1.5 min

5 min

20 min

60 min

3 hrs

12 hrs

2 days

10000 100000 1e+06 1e+07

Tim

e (

seconds)



Optimal (Moses/38)











5 Word Alignment







P (fm1 |el

1) =∑

am1

P (fm1 , am

1 |el1)

=∑

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )


am1 = arg max

am1

P (am1 |el

1, fm1 )

m∏

j=1

P (fj |eaj )



1, fm1 ) is uniform



1 |el1, f

m1 ) =

∏mj=1 P (aj |aj−1).

Still too complicated, so we make one of two further assumptions:

Word alignment

Estimating the parameters for these models is moredifficult (and more computationally expensive) thanwith the models considered in the previous section:rather than simply being able to count the word pairsand alignment relationships and estimate the mod-els directly, we must use an existing model to com-pute the expected counts for all possible alignments,and then use these counts to update the new model.7

This training strategy is referred to as expectation-maximization (EM) and is guaranteed to always im-prove the quality of the prior model at each iteration(Brown et al., 1993; Dempster et al., 1977).

Although it is necessary to compute a sum over allpossible alignments, the independence assumptionsmade in these models allow the total probability ofgenerating a particular observation to be efficientlycomputed using dynamic programming.8 The HMMalignment model uses the forward-backward algo-rithm (Baum et al., 1970), which is also an in-stance of EM. Even with dynamic programming,this requires O(Slm) operations for Model 1, andO(Slm2) for the HMM model, where m and l arethe average lengths of the foreign and English sen-tences in the training corpus, and S is the number ofsentences. Figure 6 shows measurements of the av-erage iteration run-time for Model 1 and the HMMalignment model as implemented in Giza++ (Ochand Ney, 2003), a state-of-the-art C++ implemen-tation of the IBM and HMM alignment models thatis widely used. Five iterations are generally neces-sary to train the models, so the time to carry out fulltraining of the models is approximately five times theper-iteration run-time.

5.1 EM with MapReduce

Expectation-maximization algorithms can be ex-pressed quite naturally in the MapReduce frame-work (Chu et al., 2006). In general, for discrete gen-erative models, mappers iterate over the training in-stances and compute the partial expected counts forall the unobservable events in the model that should

7For the first iteration, when there is no prior model, aheuristic, random, or uniform distribution may be chosen.

8For IBM Models 3-5, which are not our primary focus, dy-namic programming is not possible, but the general strategy forcomputing expected counts from a previous model and updat-ing remains identical and therefore the techniques we suggestin this section are applicable to those models as well.

3 s

10 s

30 s

90 s

3m20s

20 min

60 min

3 hrs

10000 100000 1e+06

Ave

rag

e ite

ratio

n la

ten

cy (

se

co

nd

s)


Model 1HMM

Figure 6: Per-iteration average run-times for Giza++ im-plementations of Model 1 and HMM training on corporaof various sizes.

be associated with the given training instance. Re-ducers aggregate these partial counts to computethe total expected joint counts. The updated modelis estimated using the maximum likelihood crite-rion, which just involves computing the appropri-ate marginal and dividing (as with the phrase-basedmodels), and the same techniques suggested in Sec-tion 3 can be used with no modification for thispurpose. For word alignment models, Method 3is possible since word pairs distribute according toZipf’s law (meaning there is ample opportunity forthe combiners to combine records), and the numberof parameters for P (e|fj = f) is at most the num-ber of items in the vocabulary of E, which tends tobe on the order of hundreds of thousands of words,even for large corpora.

Since the alignment models we are consideringare fundamentally based on a lexical translationprobability model, i.e., the conditional probabilitydistribution P (e|f), we describe in some detail howEM updates the parameters for this model.9 Usingthe model parameters from the previous iteration (orstarting from an arbitrary or heuristic set of param-eters during the first iteration), an expected count iscomputed for every l × m pair 〈ei, fj〉 for each par-allel sentence in the training corpus. Figure 7 illus-

9Although computation of expected count for a word pairin a given training instance obviously depends on which modelis being used, the set of word pairs for which partial counts areproduced for each training instance, as well as the process of ag-gregating the partial counts and updating the model parameters,is identical across this entire class of models.

A familiar problem with the conventional tools:

EM for MapReduce

• EM relies on MLE, but counts are fractional rather than whole

• Same MR strategies are available (and same optimizations!)

the

blue

house

maison la bleue fleur

flower

la maison

the house

la maison bleue la fleur

the blue house the flower

(a)

(b)

Figure 7: Each cell in (a) contains the expected counts forthe word pair 〈ei, fj〉. In (b) the example training data ismarked to show which training instances contribute par-tial counts for the pair 〈house, maison〉.

3 s

10 s

30 s

90 s

3m20s

20 min

60 min

3 hrs

10000 100000 1e+06

Tim

e (

seconds)


Optimal Model 1 (Giza/38)Optimal HMM (Giza/38)

MapReduce Model 1 (38 M/R)MapReduce HMM (38 M/R)

Figure 8: Average per-iteration latency to train HMMand Model 1 using the MapReduce EM trainer, comparedto an optimal parallelization of Giza++ across the samenumber of processors.

trates the relationship between the individual train-ing instances and the global expected counts for aparticular word pair. After collecting counts, theconditional probability P (f |e) is computed by sum-ming over all columns for each f and dividing. Notethat under this training regime, a non-zero probabil-ity P (fj |ei) will be possible only if ei and fj co-occur in at least one training instance.

5.2 Experimental Results

Figure 8 shows the timing results of the MapReduceimplementation of Model 1 and the HMM alignmentmodel. Similar to the phrase extraction experiments,we show as reference the running time of a hy-pothetical, optimally-parallelized version of Giza++on our cluster (i.e., values in Figure 6 divided by38). Whereas in the single-core implementation the

added complexity of the HMM model has a signif-icant impact on the per-iteration running time, thedata exchange overhead dominates in the perfor-mance of both models in a MapReduce environment,making running time virtually indistinguishable. Forthese experiments, after each EM iteration, the up-dated model parameters (which are computed in adistributed fashion) are compiled into a compressedrepresentation which is then distributed to all theprocessors in the cluster at the beginning of the nextiteration. The time taken for this process is includedin the iteration latencies shown in the graph. In fu-ture work, we plan to use a distributed model repre-sentation to improve speed and scalability.

6 Related work

Expectation-maximization algorithms have beenpreviously deployed in the MapReduce frameworkin the context of several different applications (Chuet al., 2006; Das et al., 2007; Wolfe et al., 2007).Wolfe et al. (2007) specifically looked at the perfor-mance of Model 1 on MapReduce and discuss howseveral different strategies can minimize the amountof communication required but they ultimately ad-vocate abandoning the MapReduce model. Whiletheir techniques do lead to modest performance im-provements, we question the cost-effectiveness ofthe approach in general, since it sacrifices many ofthe advantages provided by the MapReduce envi-ronment. In our future work, we instead intend tomake use of an approach suggested by Das et al.(2007), who show that a distributed database run-ning in tandem with MapReduce can be used toprovide the parameters for very large mixture mod-els efficiently. Moreover, since the database is dis-tributed across the same nodes as the MapReducejobs, many of the same data locality benefits thatWolfe et al. (2007) sought to capitalize on will beavailable without abandoning the guarantees of theMapReduce paradigm.

Although it does not use MapReduce, the MTTKtool suite implements distributed Model 1, 2 andHMM training using a “home-grown” paralleliza-tion scheme (Deng and Byrne, 2006). However, thetool relies on a cluster where all nodes have access tothe same shared networked file storage, a restrictionthat MapReduce does not impose.

0.80.6

MapReduce word alignment

3 s

10 s

30 s

90 s

3m20s

20 min

60 min

3 hrs

10000 100000 1e+06

Tim

e (

se

co

nd

s)


HMM alignment (Giza toolkit)


3 s

10 s

30 s

90 s

3m20s

20 min

60 min

3 hrs

10000 100000 1e+06

Tim

e (

se

co

nd

s)


HMM alignment (Giza toolkit)HMM alignment (MapReduce implementation)


3 s

10 s

30 s

90 s

3m20s

20 min

60 min

3 hrs

10000 100000 1e+06

Tim

e (

se

co

nd

s)


HMM alignment (hypothetical, optimally-parallelized)HMM alignment (MapReduce implementation)






language model


26h17m 48h06m

1.2s / sent

1h58m






language model


26h17m 48h06m

1.2s / sent

1h58m0h57m

Future Work

• Word alignment

• How to access/distribute the prior model?

• Is EM really a good choice?

• Good results in a Bayesian framework

• Ongoing work using a CRF-based model

• Are exact solutions really necessary?

• How can we improve data locality?

Thank You!

*This research was supported by the GALE program of the DefenseAdvanced Projects Agency, Contract No. HR0011-06-2-0001

Jimmy Lin

Eugene Hung

Philip Resnik

Miles Osborne

Chris Manning

CBCB@UMD

IBM

Google

SMT Model Building with MapReduce

Documents

Transcript of SMT Model Building with MapReduce