Improving Statistical Machine Translation Using Word Sense...

Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and ComputationalNatural Language Learning, pp. 61–72, Prague, June 2007. c©2007 Association for Computational Linguistics

Improving Statistical Machine Translation usingWord Sense Disambiguation

Marine CARPUAT Dekai WU∗

[email protected] [email protected]

Human Language Technology CenterHKUST

Department of Computer Science and EngineeringUniversity of Science and Technology, Clear Water Bay, Hong Kong

Abstract

We show for the first time that incorporatingthe predictions of a word sense disambigua-tion system within a typical phrase-basedstatistical machine translation (SMT) modelconsistently improves translation qualityacross all three different IWSLT Chinese-English test sets, as well as producing sta-tistically significant improvements on thelarger NIST Chinese-English MT task—and moreover never hurts performance onany test set, according not only to BLEUbut to all eight most commonly used au-tomatic evaluation metrics. Recent workhas challenged the assumption that wordsense disambiguation (WSD) systems areuseful for SMT. Yet SMT translation qual-ity still obviously suffers from inaccuratelexical choice. In this paper, we addressthis problem by investigating a new strat-egy for integrating WSD into an SMT sys-tem, that performs fully phrasal multi-worddisambiguation. Instead of directly incor-porating a Senseval-style WSD system, weredefine the WSD task to match the ex-act same phrasal translation disambiguationtask faced by phrase-based SMT systems.Our results provide the first known empir-ical evidence that lexical semantics are in-deed useful for SMT, despite claims to thecontrary.

∗This material is based upon work supported in part bythe Defense Advanced Research Projects Agency (DARPA)under GALE Contract No. HR0011-06-C-0023, and by theHong Kong Research Grants Council (RGC) research grants

1 Introduction

Common assumptions about the role and useful-ness of word sense disambiguation (WSD) modelsin full-scale statistical machine translation (SMT)systems have recently been challenged.

On the one hand, in previous work (Carpuat andWu, 2005b) we obtained disappointing results whenusing the predictions of a Senseval WSD system inconjunction with a standard word-based SMT sys-tem: we reported slightly lower BLEU scores de-spite trying to incorporate WSD using a numberof apparently sensible methods. These results castdoubt on the assumption that sophisticated dedicatedWSD systems that were developed independentlyfrom any particular NLP application can easily beintegrated into a SMT system so as to improve trans-lation quality through stronger models of contextand rich linguistic information. Rather, it has beenargued, SMT systems have managed to achieve sig-nificant improvements in translation quality withoutdirectly addressing translation disambiguation as aWSD task. Instead, translation disambiguation deci-sions are made indirectly, typically using only wordsurface forms and very local contextual information,forgoing the much richer linguistic information thatWSD systems typically take advantage of.

On the other hand, error analysis reveals that theperformance of SMT systems still suffers from inac-curate lexical choice. In subsequent empirical stud-ies, we have shown that SMT systems perform muchworse than dedicated WSD models, both supervised

RGC6083/99E, RGC6256/00E, and DAG03/04.EG09. Anyopinions, findings and conclusions or recommendations ex-pressed in this material are those of the author(s) and do notnecessarily reflect the views of the Defense Advanced ResearchProjects Agency.

61

and unsupervised, on a Senseval WSD task (Carpuatand Wu, 2005a), and therefore suggest that WSDshould have a role to play in state-of-the-art SMTsystems. In addition to the Senseval shared tasks,which have provided standard sense inventories anddata sets, WSD research has also turned increasinglyto designing specific models for a particular applica-tion. For instance, Vickrey et al. (2005) and Specia(2006) proposed WSD systems designed for Frenchto English, and Portuguese to English translation re-spectively, and present a more optimistic outlook forthe use of WSD in MT, although these WSD sys-tems have not yet been integrated nor evaluated infull-scale machine translation systems.

Taken together, these seemingly contradictory re-sults suggest that improving SMT lexical choice ac-curacy remains a key challenge to improve currentSMT quality, and that it is still unclear what isthe most appropriate integration framework for theWSD models in SMT.

In this paper, we present first results with anew architecture that integrates a state-of-the-artWSD model into phrase-based SMT so as to per-form multi-word phrasal lexical disambiguation,and show that this new WSD approach not onlyproduces gains across all available Chinese-EnglishIWSLT06 test sets for all eight commonly used au-tomated MT evaluation metrics, but also producesstatistically significant gains on the much largerNIST Chinese-English task. The main differencebetween this approach and several of our earlier ap-proaches as described in Carpuat and Wu (2005b)and subsequently Carpuat et al. (2006) lies in thefact that we focus on repurposing the WSD systemfor multi-word phrase-based SMT. Rather than us-ing a generic Senseval WSD model as we did inCarpuat and Wu (2005b), here both the WSD train-ing and the WSD predictions are integrated into thephrase-based SMT framework. Furthermore, ratherthan using a single word based WSD approach toaugment a phrase-based SMT model as we did inCarpuat et al. (2006) to improve BLEU and NISTscores, here the WSD training and predictions oper-ate on full multi-word phrasal units, resulting in sig-nificantly more reliable and consistent gains as eva-luted by many other translation accuracy metrics aswell. Specifically:

• Instead of using a Senseval system, we redefinethe WSD task to be exactly the same as lexi-cal choice task faced by the multi-word phrasaltranslation disambiguation task faced by thephrase-based SMT system.

• Instead of using predefined senses drawn frommanually constructed sense inventories such asHowNet (Dong, 1998), our WSD for SMT sys-tem directly disambiguates between all phrasaltranslation candidates seen during SMT train-ing.

• Instead of learning from manually annotatedtraining data, our WSD system is trained on thesame corpora as the SMT system.

However, despite these adaptations to the SMTtask, the core sense disambiguation task remainspure WSD:

• The rich context features are typical of WSDand almost never used in SMT.

• The dynamic integration of context-sensitivetranslation probabilities is not typical of SMT.

• Although it is embedded in a real SMT sys-tem, the WSD task is exactly the same as inrecent and coming Senseval Multilingual Lexi-cal Sample tasks (e.g., Chklovski et al. (2004)),where sense inventories represent the semanticdistinctions made by another language.

We begin by presenting the WSD module andthe SMT integration technique. We then show thatincorporating it into a standard phrase-based SMTbaseline system consistently improves translationquality across all three different test sets from theChinese-English IWSLT text translation evaluation,as well as on the larger NIST Chinese-English trans-lation task. Depending on the metric, the individualgains are sometimes modest, but remarkably, incor-porating WSD never hurts, and helps enough to al-ways make it a worthwile additional component inan SMT system. Finally, we analyze the reasons forthe improvement.

62

2 Problems in context-sensitive lexicalchoice for SMT

To the best of our knowledge, there has been no pre-vious attempt at integrating a state-of-the-art WSDsystem for fully phrasal multi-word lexical choiceinto phrase-based SMT, with evaluation of the re-sulting system on a translation task. While thereare many evaluations of WSD quality, in particularthe Senseval series of shared tasks (Kilgarriff andRosenzweig (1999), Kilgarriff (2001), Mihalcea etal. (2004)), very little work has been done to addressthe actual integration of WSD in realistic SMT ap-plications.

To fully integrate WSD into phrase-based SMT,it is necessary to perform lexical disambiguationon multi-word phrasal lexical units; in contrast,the model reported in Cabezas and Resnik (2005)can only perform lexical disambiguation on sin-gle words. Like the model proposed in this paper,Cabezas and Resnik attempted to integrate phrase-based WSD models into decoding. However, al-though they reported that incorporating these predic-tions via the Pharaoh XML markup scheme yieldeda small improvement in BLEU score over a Pharaohbaseline on a single Spanish-English translation dataset, we have determined empirically that applyingtheir single-word based model to several Chinese-English datasets does not yield systematic improve-ments on most MT evaluation metrics (Carpuat andWu, 2007). The single-word model has the disad-vantage of forcing the decoder to choose betweenthe baseline phrasal translation probabilities versusthe WSD model predictions for single words. In ad-dition, the single-word model does not generalizeto WSD for phrasal lexical choice, as overlappingspans cannot be specified with the XML markupscheme. Providing WSD predictions for phraseswould require committing to a phrase segmenta-tion of the input sentence before decoding, whichis likely to hurt translation quality.

It is also necessary to focus directly on translationaccuracy rather than other measures such as align-ment error rate, which may not actually lead to im-proved translation quality; in contrast, for example,Garcia-Varea et al. (2001) and Garcia-Varea et al.(2002) show improved alignment error rate with amaximum entropy based context-dependent lexical

choice model, but not improved translation accu-racy. In contrast, our evaluation in this paper is con-ducted on the actual decoding task, rather than in-termediate tasks such as word alignment. Moreover,in the present work, all commonly available auto-mated MT evaluation metrics are used, rather thanonly BLEU score, so as to maintain a more balancedperspective.

Another problem in the context-sensitive lexicalchoice in SMT models of Garcia Varea et al. is thattheir feature set is insufficiently rich to make muchbetter predictions than the SMT model itself. Incontrast, our WSD-based lexical choice models aredesigned to directly model the lexical choice in theactual translation direction, and take full advantageof not residing strictly within the Bayesian source-channel model in order to benefit from the muchricher Senseval-style feature set this facilitates.

Garcia Varea et al. found that the best results areobtained when the training of the context-dependenttranslation model is fully incorporated with the EMtraining of the SMT system. As described below,the training of our new WSD model, though not in-corporated within the EM training, is also far moreclosely tied to the SMT model than is the case withtraditional standalone WSD models.

In contrast with Brown et al. (1991), our ap-proach incorporates the predictions of state-of-the-art WSD models that use rich contextual features forany phrase in the input vocabulary. In Brown et al.’searly study of WSD impact on SMT performance,the authors reported improved translation quality ona French to English task, by choosing an Englishtranslation for a French word based on the singlecontextual feature which is reliably discriminative.However, this was a pilot study, which is limited towords with exactly two translation candidates, and itis not clear that the conclusions would generalize tomore recent SMT architectures.

3 Problems in translation-oriented WSD

The close relationship between WSD and SMT hasbeen emphasized since the emergence of WSD asan independent task. However, most of previous re-search has focused on using multilingual resourcestypically used in SMT systems to improve WSD ac-curacy, e.g., Dagan and Itai (1994), Li and Li (2002),

63

Diab (2004). In contrast, this paper focuses on theconverse goal of using WSD models to improve ac-tual translation quality.

Recently, several researchers have focused on de-signing WSD systems for the specific purpose oftranslation. Vickrey et al. (2005) train a logistic re-gression WSD model on data extracted from auto-matically word aligned parallel corpora, but evaluateon a blank filling task, which is essentially an eval-uation of WSD accuracy. Specia (2006) describesan inductive logic programming-based WSD sys-tem, which was specifically designed for the purposeof Portuguese to English translation, but this systemwas also only evaluated on WSD accuracy, and notintegrated in a full-scale machine translation system.

Ng et al. (2003) show that it is possible to useautomatically word aligned parallel corpora to trainaccurate supervised WSD models. The purpose ofthe study was to lower the annotation cost for su-pervised WSD, as suggested earlier by Resnik andYarowsky (1999). However this result is also en-couraging for the integration of WSD in SMT, sinceit suggests that accurate WSD can be achieved usingtraining data of the kind needed for SMT.

4 Building WSD models for phrase-basedSMT

4.1 WSD models for every phrase in the inputvocabulary

Just like for the baseline phrase translation model,WSD models are defined for every phrase in the in-put vocabulary. Lexical choice in SMT is naturallyframed as a WSD problem, so the first step of inte-gration consists of defining a WSD model for everyphrase in the SMT input vocabulary.

This differs from traditional WSD tasks, wherethe WSD target is a single content word. Sense-val for instance has either lexical sample or all wordtasks. The target words for both categories of Sen-seval WSD tasks are typically only content words—primarily nouns, verbs, and adjectives—while in thecontext of SMT, we need to translate entire sen-tences, and therefore have a WSD model not onlyfor every word in the input sentences, regardless oftheir POS tag, but for every phrase, including tokenssuch as articles, prepositions and even punctuation.Further empirical studies have suggested that includ-

ing WSD predictions for those longer phrases is akey factor to help the decoder produce better trans-lations (Carpuat and Wu, 2007).

4.2 WSD uses the same sense definitions as theSMT system

Instead of using pre-defined sense inventories, theWSD models disambiguate between the SMT trans-lation candidates. In order to closely integrate WSDpredictions into the SMT system, we need to formu-late WSD models so that they produce features thatcan directly be used in translation decisions takenby the SMT system. It is therefore necessary for theWSD and SMT systems to consider exactly the sametranslation candidates for a given word in the inputlanguage.

Assuming a standard phrase-based SMT system(e.g., Koehn et al. (2003)), WSD senses are thus ei-ther words or phrases, as learned in the SMT phrasaltranslation lexicon. Those “sense” candidates arevery different from those typically used even in ded-icated WSD tasks, even in the multilingual Sensevaltasks. Each candidate is a phrase that is not neces-sarily a syntactic noun or verb phrase as in manuallycompiled dictionaries. It is quite possible that dis-tinct “senses” in our WSD for SMT system could beconsidered synonyms in a traditional WSD frame-work, especially in monolingual WSD.

In addition to the consistency requirements for in-tegration, this requirement is also motivated by em-pirical studies, which show that predefined trans-lations derived from sense distinctions defined inmonolingual ontologies do not match translationdistinction made by human translators (Specia et al.,2006).

4.3 WSD uses the same training data as theSMT system

WSD training does not require any other resourcesthan SMT training, nor any manual sense annota-tion. We employ supervised WSD systems, sinceSenseval results have amply demonstrated that su-pervised models significantly outperform unsuper-vised approaches (see for instance the English lexi-cal sample tasks results described by Mihalcea et al.(2004)).

Training examples are annotated using the phrasealignments learned during SMT training. Every in-

64

put language phrase is sense-tagged with its alignedoutput language phrase in the parallel corpus. Thephrase alignment method used to extract the WSDtraining data therefore depends on the one used bythe SMT system. This presents the advantage oftraining WSD and SMT models on exactly the samedata, thus eliminating domain mismatches betweenSenseval data and parallel corpora. But most impor-tantly, this allows WSD training data to be gener-ated entirely automatically, since the parallel corpusis automatically phrase-aligned in order to learn theSMT phrase bilexicon.

4.4 The WSD systemThe word sense disambiguation subsystem is mod-eled after the best performing WSD system in theChinese lexical sample task at Senseval-3 (Carpuatet al., 2004).

The features employed are typical of WSD andare therefore far richer than those used in mostSMT systems. The feature set consists of position-sensitive, syntactic, and local collocational fea-tures, since these features yielded the best resultswhen combined in a naıve Bayes model on severalSenseval-2 lexical sample tasks (Yarowsky and Flo-rian, 2002). These features scale easily to the biggervocabulary and sense candidates to be considered ina SMT task.

The Senseval system consists of an ensemble offour combined WSD models:

The first model is a naıve Bayes model, sinceYarowsky and Florian (2002) found this model to bethe most accurate classifier in a comparative studyon a subset of Senseval-2 English lexical sampledata.

The second model is a maximum entropy model(Jaynes, 1978), since Klein and Manning (Kleinand Manning, 2002) found that this model yieldedhigher accuracy than naıve Bayes in a subsequentcomparison of WSD performance.

The third model is a boosting model (Freundand Schapire, 1997), since boosting has consistentlyturned in very competitive scores on related taskssuch as named entity classification. We also use theAdaboost.MH algorithm.

The fourth model is a Kernel PCA-based model(Wu et al., 2004). Kernel Principal ComponentAnalysis or KPCA is a nonlinear kernel method for

extracting nonlinear principal components from vec-tor sets where, conceptually, the n-dimensional in-put vectors are nonlinearly mapped from their origi-nal space Rn to a high-dimensional feature space Fwhere linear PCA is performed, yielding a transformby which the input vectors can be mapped nonlin-early to a new set of vectors (Scholkopf et al., 1998).WSD can be performed by a Nearest Neighbor Clas-sifier in the high-dimensional KPCA feature space.

All these classifiers have the ability to handlelarge numbers of sparse features, many of whichmay be irrelevant. Moreover, the maximum entropyand boosting models are known to be well suited tohandling features that are highly interdependent.

4.5 Integrating WSD predictions inphrase-based SMT architectures

It is non-trivial to incorporate WSD into an existingphrase-based architecture such as Pharaoh (Koehn,2004), since the decoder is not set up to easily ac-cept multiple translation probabilities that are dy-namically computed in context-sensitive fashion.

For every phrase in a given SMT input sentence,the WSD probabilities can be used as additional fea-ture in a loglinear translation model, in combina-tion with typical context-independent SMT bilexi-con probabilities.

We overcome this obstacle by devising a callingarchitecture that reinitializes the decoder with dy-namically generated lexicons on a per-sentence ba-sis.

Unlike a n-best reranking approach, which is lim-ited by the lexical choices made by the decoder us-ing only the baseline context-independent transla-tion probabilities, our method allows the system tomake full use of WSD information for all competingphrases at all decoding stages.

5 Experimental setup

The evaluation is conducted on two standard Chi-nese to English translation tasks. We follow stan-dard machine translation evaluation procedure us-ing automatic evaluation metrics. Since our goal isto evaluate translation quality, we use standard MTevaluation methodology and do not evaluate the ac-curacy of the WSD model independently.

65

Table 1: Evaluation results on the IWSLT06 dataset: integrating the WSD translation predictions improvesBLEU, NIST, METEOR, WER, PER, CDER and TER across all 3 different available test sets.

TestSet

Exper. BLEU NIST METEOR METEOR(no syn)

TER WER PER CDER

Test 1 SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09SMT+WSD 42.38 7.902 65.73 63.64 39.98 45.30 37.60 39.91

Test 2 SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35SMT+WSD 41.97 8.244 66.35 63.86 40.63 46.14 37.25 40.10

Test 3 SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46SMT+WSD 51.05 9.142 74.13 71.44 34.68 39.75 31.71 34.58

Table 2: Evaluation results on the NIST test set: integrating the WSD translation predictions improvesBLEU, NIST, METEOR, WER, PER, CDER and TER

Exper. BLEU NIST METEOR METEOR(no syn)

TER WER PER CDER

SMT 20.41 7.155 60.21 56.15 76.76 88.26 61.71 70.32SMT+WSD 20.92 7.468 60.30 56.79 71.34 83.87 57.29 67.38

5.1 Data setPreliminary experiments are conducted using train-ing and evaluation data drawn from the multilin-gual BTEC corpus, which contains sentences used inconversations in the travel domain, and their transla-tions in several languages. A subset of this data wasmade available for the IWSLT06 evaluation cam-paign (Paul, 2006); the training set consists of 40000sentence pairs, and each test set contains around 500sentences. We used only the pure text data, and notthe speech transcriptions, so that speech-specific is-sues would not interfere with our primary goal of un-derstanding the effect of integrating WSD in a full-scale phrase-based model.

A larger scale evaluation is conducted on the stan-dard NIST Chinese-English test set (MT-04), whichcontains 1788 sentences drawn from newswire cor-pora, and therefore of a much wider domain than theIWSLT data set. The training set consists of about 1million sentence pairs in the news domain.

Basic preprocessing was applied to the corpus.The English side was simply tokenized and case-normalized. The Chinese side was word segmentedusing the LDC segmenter.

5.2 Baseline SMT systemSince our focus is not on a specific SMT architec-ture, we use the off-the-shelf phrase-based decoder

Pharaoh (Koehn, 2004) trained on the IWSLT train-ing set. Pharaoh implements a beam search decoderfor phrase-based statistical models, and presentsthe advantages of being freely available and widelyused.

The phrase bilexicon is derived from the inter-section of bidirectional IBM Model 4 alignments,obtained with GIZA++ (Och and Ney, 2003), aug-mented to improve recall using the grow-diag-finalheuristic. The language model is trained on the Eng-lish side of the corpus using the SRI language mod-eling toolkit (Stolcke, 2002).

The loglinear model weights are learned usingChiang’s implementation of the maximum BLEUtraining algorithm (Och, 2003), both for the base-line, and the WSD-augmented system. Due totime constraints, this optimization was only con-ducted on the IWSLT task. The weights used in theWSD-augmented NIST model are based on the bestIWSLT model. Given that the two tasks are quitedifferent, we expect further improvements on theWSD-augmented system after running maximumBLEU optimization for the NIST task.

6 Results and discussion

Using WSD predictions in SMT yields better trans-lation quality on all test sets, as measured by alleight commonly used automatic evaluation metrics.

66

Table 3: Translation examples with and without WSD for SMT, drawn from IWSLT data sets.Input ÷lX-. ¿�Ref. Please transfer to the Chuo train line.SMT Please turn to the Central Line.SMT+WSD Please transfer to Central Line.Input fh(fp��Ref. Do I pay on the bus?SMT Please get on the bus?SMT+WSD I buy a ticket on the bus?Input ��¢��Ref. Do I need a reservation?SMT I need a reservation?SMT+WSD Do I need a reservation?Input �ó�n¤��Ù h��¢�Ref. I want to reconfirm this ticket.SMT I would like to reconfirm a flight for this ticket.SMT+WSD I would like to reconfirm my reservation for this ticket.Input eLïå0£Ì��Ref. Can I get there on foot?SMT Is there on foot?SMT+WSD Can I get there on foot?Input �æ��*¦��@å÷ë¹�Ref. I have another appointment, so please hurry.SMT I have an appointment for a, so please hurry.SMT+WSD I have another appointment, so please hurry.Input ù w�`ýJÉ�0~�G�ï��Ref. Excuse me. Could you tell me the way to Broadway?SMT Could you tell me the way to Broadway? I am sorry.SMT+WSD Excuse me, could you tell me the way to Broadway?Input ù w��ó��*&7�Ref. Excuse me, I want to open an account.SMT Excuse me, I would like to have an account.SMT+WSD Excuse me, I would like to open an account.

The results are shown in Table 1 for IWSLT and Ta-ble 2 for the NIST task. Paired bootstrap resamplingshows that the improvements on the NIST test setare statistically significant at the 95% level.

Remarkably, integrating WSD predictions helpsall the very different metrics. In addition to thewidely used BLEU (Papineni et al., 2002) and NIST(Doddington, 2002) scores, we also evaluate trans-lation quality with the recently proposed Meteor(Banerjee and Lavie, 2005) and four edit-distancestyle metrics, Word Error Rate (WER), Position-independent word Error Rate (PER) (Tillmann et

al., 1997), CDER, which allows block reordering(Leusch et al., 2006), and Translation Edit Rate(TER) (Snover et al., 2006). Note that we reportMeteor scores computed both with and without us-ing WordNet synonyms to match translation candi-dates and references, showing that the improvementis not due to context-independent synonym matchesat evaluation time.

Comparison of the 1-Best decoder output withand without the WSD feature shows that the sen-tences differ by one or more token respectively for25.49%, 30.40% and 29.25% of IWSLT test sets 1,

67

Table 4: Translation examples with and without WSD for SMT, drawn from the NIST test set.Input ¡ûU®X�hÍùÖ�SMT Without any congressmen voted against him.SMT+WSD No congressmen voted against him.Input Ä(fã�L�?VåÊùìTS»ý��¦ô/ä�ýÅç�SMT Russia’s policy in Chechnya and CIS neighbors attitude is even more worried that the

United States.SMT+WSD Russia’s policy in Chechnya and its attitude toward its CIS neighbors cause the United

States still more anxiety.Input ó��ý�ºC¶µb�SMT As for the U.S. human rights conditions?SMT+WSD As for the human rights situation in the U.S.?Input �ÂÜ/:�HBå,��s�Ac�SMT The purpose of my visit to Japan is pray for peace and prosperity.SMT+WSD The purpose of my visit is to pray for peace and prosperity for Japan.Input : 2�P�;¨��Iöf¹ÇÖ�M@*�%ÆÝ�ª½�SMT In order to prevent terrorist activities Los Angeles, the police have taken unprecedented

tight security measures.SMT+WSD In order to prevent terrorist activities Los Angeles, the police to an unprecedented tight

security measures.

2 and 3, and 95.74% of the NIST test set.Tables 3 and 4 show examples of translations

drawn from the IWSLT and NIST test sets respec-tively.

A more detailed analysis reveals WSD predic-tions give better rankings and are more discrimi-native than baseline translation probabilities, whichhelps the final translation in three different ways.

• The rich context features help rank the correcttranslation first with WSD while it is rankedlower according to baseline translation proba-bility scores .

• Even when WSD and baseline translation prob-abilities agree on the top translation candidate,the stronger WSD scores help override wronglanguage model predictions.

• The strong WSD scores for phrases help thedecoder pick longer phrase translations, whileusing baseline translation probabilities oftentranslate those phrases in smaller chunks thatinclude a frequent (and incorrect) translationcandidate.

For instance, the top 4 Chinese sentences in Ta-

ble 4, are better translated by the WSD-augmentedsystem because the WSD scores help the decoderto choose longer phrases. In the first example,the phrase “¡ ûU” is correctly translated asa whole as “No” by the WSD-augmented system,while the baseline translates each word separatelyyielding an incorrect translation. In the followingthree examples, the WSD system encourages the de-coder to translate the long phrases “ô / ä �ýÅç”, “�ý � ºC ¶µ”, and “HB å, ��s�Ac” as single units, while the baseline in-troduces errors by breaking them down into shorterphrases.

The last sentence in the table shows an examplewhere the WSD predictions do not help the base-line system. The translation quality is actually muchworse, since the verb “ÇÖ” is incorrectly trans-lated as “to”, despite the fact that the top candidatepredicted by the WSD system alone is the much bet-ter translation “has taken”, but with a relatively lowprobability of 0.509.

7 Conclusion

We have shown for the first time that integratingmulti-word phrasal WSD models into phrase-based

68

SMT consistently helps on all commonly availableautomated translation quality evaluation metrics onall three different test sets from the Chinese-EnglishIWSLT06 text translation task, and yields statisti-cally significant gains on the larger NIST Chinese-English task. It is important to note that the WSDmodels never hurt translation quality, and alwaysyield individual gains of a level that makes their in-tegration always worthwile.

We have proposed to consistently integrate WSDmodels both during training, where sense definitionsand sense-annotated data are automatically extractedfrom the word-aligned parallel corpora from SMTtraining, and during testing, where the phrasal WSDprobabilities are used by the SMT system just likeall the other lexical choice features.

Context features are derived from state-of-the-artWSD models, and the evaluation is conducted on theactual translation task, rather than intermediate taskssuch as word alignment.

It is to be emphasized that this approach does notmerely consist of adding a source sentence featurein the log linear model for translation. On the con-trary, it remains a real WSD task, defined just asin the Senseval Multilingual Lexical Sample tasks(e.g., Chklovski et al. (2004)). Our model makes useof typical WSD features that are almost never usedin SMT systems, and requires a dynamically createdtranslation lexicon on a per-sentence basis.

To our knowledge this constitues the first attemptat fully integrating state-of-the-art WSD with con-ventional phrase-based SMT. Unlike previous ap-proaches, the WSD targets are not only single words,but multi-word phrases, just as in the SMT sys-tem. This means that WSD senses are unusuallypredicted not only for a limited set of single wordsor very short phrases, but for all phrases of arbitrar-ily length that are in the SMT translation lexicon.The single word approach, as we reported in Carpuatet al. (2006), improved BLEU and NIST scoresfor phrase-based SMT, but subsequent detailed em-pirical studies we have performed since then sug-gest that single word WSD approaches are less suc-cessful when evaluated under all other MT metrics(Carpuat and Wu, 2007). Thus, fully phrasal WSDpredictions for longer phrases, as reported in this pa-per, are particularly important to improve translationquality.

The results reported in this paper cast new light onthe WSD vs. SMT debate, suggesting that a closeintegration of WSD and SMT decisions should beincorporated in a SMT model that successfully usesWSD predictions. Our objective here is to demon-strate that this technique works for the widest pos-sible class of models, so we have chosen as thebaseline the most widely used phrase-based SMTmodel. Our positive results suggest that our ex-periments could be tried on other current statisticalMT models, especially the growing family of tree-structured SMT models employing stochastic trans-duction grammars of various sorts (Wu and Chiang,2007). For instance, incorporating WSD predictionsinto an MT decoder based on inversion transductiongrammars (Wu, 1997)—such as the Bracketing ITGbased models of Wu (1996), Zens et al. (2004), orCherry and Lin (2007)—would present an intriguingcomparison with the present work. It would also beinteresting to assess whether a more grammaticallystructured statistical MT model that is less relianton an n-gram language model, such as the syntacticITG based “grammatical channel” translation modelof (Wu and Wong, 1998), could make more effectiveuse of WSD predictions.

References

Satanjeev Banerjee and Alon Lavie. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgement. InProceedings of Workshop on Intrinsic and Extrin-sic Evaluation Measures for MT and/or Summa-rization at the 43th Annual Meeting of the Associ-ation of Computational Linguistics (ACL-2005),Ann Arbor, Michigan, June 2005.

Peter Brown, Stephen Della Pietra, Vincent DellaPietra, and Robert Mercer. Word-sense disam-biguation using statistical methods. In Proceed-ings of 29th meeting of the Association for Com-putational Linguistics, pages 264–270, Berkeley,California, 1991.

Clara Cabezas and Philip Resnik. Using WSD tech-niques for lexical selection in statistical machinetranslation. Technical report, Institute for Ad-vanced Computer Studies, University of Mary-land, 2005.

Marine Carpuat and Dekai Wu. Evaluating the word

69

sense disambiguation performance of statisticalmachine translation. In Proceedings of the SecondInternational Joint Conference on Natural Lan-guage Processing (IJCNLP), pages 122–127, JejuIsland, Republic of Korea, 2005.

Marine Carpuat and Dekai Wu. Word sense disam-biguation vs. statistical machine translation. InProceedings of the annual meeting of the associa-tion for computational linguistics (ACL-05), AnnArbor, Michigan, 2005.

Marine Carpuat and Dekai Wu. How phrasesense disambiguation outperforms word sensedisambiguation for statistical machine translation.Forthcoming, 2007.

Marine Carpuat, Weifeng Su, and Dekai Wu. Aug-menting ensemble classification for word sensedisambiguation with a Kernel PCA model. In Pro-ceedings of Senseval-3, Third International Work-shop on Evaluating Word Sense DisambiguationSystems, Barcelona, July 2004. SIGLEX, Associ-ation for Computational Linguistics.

Marine Carpuat, Yihai Shen, Xiaofeng Yu, andDekai Wu. Toward integrating word sense and en-tity disambiguation into statistical machine trans-lation. In Third International Workshop on Spo-ken Language Translation (IWSLT 2006), Kyoto,November 2006.

Colin Cherry and Dekang Lin. Inversion Transduc-tion Grammar for joint phrasal translation mod-eling. In Dekai Wu and David Chiang, editors,NAACL-HLT 2007 / AMTA Workshop on Syntaxand Structure in Statistical Translation (SSST),pages 17–24, Rochester, NY, April 2007.

Timothy Chklovski, Rada Mihalcea, Ted Pedersen,and Amruta Purandare. The Senseval-3 multilin-gual English-Hindi lexical sample task. In Pro-ceedings of Senseval-3, Third International Work-shop on Evaluating Word Sense DisambiguationSystems, pages 5–8, Barcelona, Spain, July 2004.SIGLEX, Association for Computational Linguis-tics.

Ido Dagan and Alon Itai. Word sense disambigua-tion using a second language monolingual corpus.Computational Linguistics, 20(4):563–596, 1994.

Mona Diab. Relieving the data acquisition bottle-neck in word sense disambiguation. In Proceed-

ings of the 42nd Annual Meeting of the Associa-tion for Computational Linguistics, 2004.

George Doddington. Automatic evaluation ofmachine translation quality using n-gram co-occurrence statistics. In Proceedings of theHuman Language Technology conference (HLT-2002), San Diego, CA, 2002.

Zhendong Dong. Knowledge description: what,how and who? In Proceedings of Inter-national Symposium on Electronic Dictionary,Tokyo, Japan, 1998.

Yoram Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning andan application to boosting. In Journal of Com-puter and System Sciences, 55(1), pages 119–139,1997.

Ismael Garcia-Varea, Franz Och, Hermann Ney, andFrancisco Casacuberta. Refined lexicon modelsfor statistical machine translation using a maxi-mum entropy approach. In Proceedings of the39th annual meeting of the association for compu-tational linguistics (ACL-01), Toulouse, France,2001.

Ismael Garcia-Varea, Franz Och, Hermann Ney,and Francisco Casacuberta. Efficient integrationof maximum entropy lexicon models within thetraining of statistical alignment models. In Pro-ceedings of AMTA-2002, pages 54–63, Tiburon,California, October 2002.

E.T. Jaynes. Where do we Stand on Maximum En-tropy? MIT Press, Cambridge MA, 1978.

Adam Kilgarriff and Joseph Rosenzweig. Frame-work and results for English Senseval. Computersand the Humanities, 34(1):15–48, 1999. Specialissue on SENSEVAL.

Adam Kilgarriff. English lexical sample task de-scription. In Proceedings of Senseval-2, Sec-ond International Workshop on Evaluating WordSense Disambiguation Systems, pages 17–20,Toulouse, France, July 2001. SIGLEX, Associa-tion for Computational Linguistics.

Dan Klein and Christopher D. Manning. Conditionalstructure versus conditional estimation in NLPmodels. In Proceedings of EMNLP-2002, Confer-ence on Empirical Methods in Natural Language

70

Processing, pages 9–16, Philadelphia, July 2002.SIGDAT, Association for Computational Linguis-tics.

Philipp Koehn, Franz Och, and Daniel Marcu. Sta-tistical phrase-based translation. In Proceedingsof HLT/NAACL-2003, Edmonton, Canada, May2003.

Philipp Koehn. Pharaoh: a beam search decoder forphrase-based statistical machine translation mod-els. In 6th Conference of the Association for Ma-chine Translation in the Americas (AMTA), Wash-ington, DC, September 2004.

Gregor Leusch, Nicola Ueffing, and Hermann Ney.Efficient MT evaluation using block movements.In Proceedings of EACL-2006 (11th Confer-ence of the European Chapter of the Associationfor Computational Linguistics), pages 241–248,Trento, Italy, April 2006.

Cong Li and Hang Li. Word translation disambigua-tion using bilingual bootstrapping. In Proceed-ings of the 40th Annual Meeting of the Associ-ation for Computational Linguistics, pages 343–351, 2002.

Rada Mihalcea, Timothy Chklovski, and Adam Kill-gariff. The Senseval-3 English lexical sampletask. In Proceedings of Senseval-3, Third Interna-tional Workshop on Evaluating Word Sense Dis-ambiguation Systems, pages 25–28, Barcelona,Spain, July 2004. SIGLEX, Association for Com-putational Linguistics.

Hwee Tou Ng, Bin Wang, and Yee Seng Chan. Ex-ploiting parallel texts for word sense disambigua-tion: An empirical study. In Proceedings of ACL-03, Sapporo, Japan, pages 455–462, 2003.

Franz Josef Och and Hermann Ney. A system-atic comparison of various statistical alignmentmodels. Computational Linguistics, 29(1):19–52,2003.

Franz Josef Och. Minimum error rate training in sta-tistical machine translation. In Proceedings of the41st Annual Meeting of the Association for Com-putational Linguistics, pages 160–167, 2003.

Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. BLEU: a method for automaticevaluation of machine translation. In Proceedings

of the 40th Annual Meeting of the Association forComputational Linguistics, 2002.

Michael Paul. Overview of the IWSLT06 evalua-tion campaign. In Third International Workshopon Spoken Language Translation (IWSLT 2006),Kyoto, November 2006.

Philip Resnik and David Yarowsky. Distinguisingsystems and distinguishing senses: New evalua-tion methods for word sense disambiguation. Nat-ural Language Engineering, 5(2):113–133, 1999.

Bernhard Scholkopf, Alexander Smola, and Klaus-Rober Muller. Nonlinear component analysis as akernel eigenvalue problem. Neural Computation,10(5), 1998.

Matthew Snover, Bonnie Dorr, Richard Schwartz,Linnea Micciulla, and John Makhoul. A studyof translation edit rate with targeted human an-notation. In Proceedings of AMTA, pages 223–231, Boston, MA, 2006. Association for MachineTranslation in the Americas.

Lucia Specia, Maria das Gracas Volpe Nunes,Gabriela Castelo Branco Ribeiro, and MarkStevenson. Multilingual versus monolingualWSD. In EACL-2006 Workshop on MakingSense of Sense: Bringing Psycholinguistics andComputational Linguistics Together, pages 33–40, Trento, Italy, April 2006.

Lucia Specia. A hybrid relational approach forWSD—first results. In Proceedings of the COL-ING/ACL 06 Student Research Workshop, pages55–60, Sydney, July 2006. ACL.

Andreas Stolcke. SRILM—an extensible languagemodeling toolkit. In International Conference onSpoken Language Processing, Denver, Colorado,September 2002.

Christoph Tillmann, Stefan Vogel, Hermann Ney,A. Zubiaga, and H. Sawaf. Accelerated DP-based search for statistical translation. In Pro-ceedings of Eurospeech’97, pages 2667–2670,Rhodes, Greece, 1997.

David Vickrey, Luke Biewald, Marc Teyssier, andDaphne Koller. Word-sense disambiguation formachine translation. In Joint Human LanguageTechnology conference and Conference on Em-pirical Methods in Natural Language Processing(HLT/EMNLP 2005), Vancouver, 2005.

71

Dekai Wu and David Chiang, editors. NAACL-HLT2007 / AMTA Workshop on Syntax and Structurein Statistical Translation (SSST). Association forComputational Linguistics, Rochester, NY, USA,April 2007.

Dekai Wu and Hongsing Wong. Machine translationwith a stochastic grammatical channel. In Pro-ceedings of COLING-ACL’98, Montreal,Canada,August 1998.

Dekai Wu, Weifeng Su, and Marine Carpuat. AKernel PCA method for superior word sense dis-ambiguation. In Proceedings of the 42nd An-nual Meeting of the Association for Computa-tional Linguistics, Barcelona, Spain, July 2004.

Dekai Wu. A polynomial-time algorithm for statis-tical machine translation. In Proceedings of 34thAnnual Meeting of the Association for Computa-tional Linguistics, Santa Cruz, California, June1996.

Dekai Wu. Stochastic inversion transduction gram-mars and bilingual parsing of parallel corpora.Computational Linguistics, 23(3):377–404, 1997.

David Yarowsky and Radu Florian. Evaluating sensedisambiguation across diverse parameter spaces.Natural Language Engineering, 8(4):293–310,2002.

Richard Zens, Hermann Ney, Taro Watanabe, andEiichiro Sumita. Reordering constraints forphrase-based statistical machine translation. In20th International Conference on ComputationalLinguistics (COLING-2004), Geneva, August2004.

72

Improving Statistical Machine Translation Using Word Sense...

Documents

Transcript of Improving Statistical Machine Translation Using Word Sense...