Research Article A Systematic Comparison of Data Selection...

11
Research Article A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, and Junwen Xing Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory, Department of Computer and Information Science, University of Macau, Macau, China Correspondence should be addressed to Longyue Wang; [email protected] Received 30 August 2013; Accepted 4 December 2013; Published 11 February 2014 Academic Editors: J. Shu and F. Yu Copyright © 2014 Longyue Wang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data selection has shown significant improvements in effective use of training data by extracting sentences from large general- domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. is paper performs an in-depth analysis of three different sentence selection techniques. e first one is cosine tf-idf, which comes from the realm of information retrieval (IR). e second is perplexity-based approach, which can be found in the field of language modeling. ese two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. Aſter investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system. 1. Introduction e performance of SMT [1] system depends heavily upon the quantity of training data as well as the domain-specificity of the test data with respect to the training data. A well-known challenge is that the data-driven system is not guaranteed to perform optimally if the data for training and testing are not identically distributed. us, domain adaptation has been the promising study to boost the domain-specific translation from the model trained on mixture of in-domain and out-of- domain data. One of the dominant approaches is to select suitable data for the target domain from a large general-domain corpus (general corpus), under the assumption that the general corpus is broad enough to cover some percentage of sentences that fall in the target domain. en a domain-adapted machine translation system can be obtained by training on the selected subset (Axelrod et al. [2] defined it as pseudo in-domain subcorpus, which will also be adopted in this paper) instead of the entire data. Our work mainly focuses on these supplementary data selection approaches, which have shown significant improvements in the construction of domain-adapted SMT models. Models trained on such appreciated data will be benefited to improve the quality of word alignments. Secondly, extraction of a large amount of irrelevant phrase pairs can be prevented, and, thirdly, the estimation of reordering factors for target sentences can also be fairly optimized. Data selection is oſten used for language model (LM) and translation model (TM) optimization. It generally comprises three processing stages in order to translate the domain- specific data using a general-domain parallel corpus and a general-domain monolingual corpus in target language. (1) Scoring. e relevance of each sentence pair , in to the target domain can be estimated by various similarity metrics, which can be uniformly stated as follows: Score ( , ) → Sim ( , ) , (1) Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 745485, 10 pages http://dx.doi.org/10.1155/2014/745485

Transcript of Research Article A Systematic Comparison of Data Selection...

Page 1: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

Research ArticleA Systematic Comparison of Data Selection Criteria forSMT Domain Adaptation

Longyue Wang Derek F Wong Lidia S Chao Yi Lu and Junwen Xing

Natural Language Processing amp Portuguese-Chinese Machine Translation LaboratoryDepartment of Computer and Information Science University of Macau Macau China

Correspondence should be addressed to Longyue Wang vincentwang0229hotmailcom

Received 30 August 2013 Accepted 4 December 2013 Published 11 February 2014

Academic Editors J Shu and F Yu

Copyright copy 2014 Longyue Wang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statisticalmachine translation (SMT) systems to in-domain dataThis paper performs an in-depth analysisof three different sentence selection techniques The first one is cosine tf-idf which comes from the realm of information retrieval(IR) The second is perplexity-based approach which can be found in the field of language modeling These two data selectiontechniques applied to SMT have been already presented in the literature However edit distance for this task is proposed in thispaper for the first time After investigating the individual model a combination of all three techniques is proposed at both corpuslevel and model level Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicatethe following (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality(ii) the individual selection models fail to perform effectively and robustly but (iii) bilingual resources and combination methodsare helpful to balance out-of-vocabulary (OOV) and irrelevant data (iv) finally our method achieves the goal to consistently boostthe overall translation performance that can ensure optimal quality of a real-life SMT system

1 Introduction

Theperformance of SMT [1] systemdepends heavily upon thequantity of training data as well as the domain-specificity ofthe test data with respect to the training data A well-knownchallenge is that the data-driven system is not guaranteedto perform optimally if the data for training and testing arenot identically distributedThus domain adaptation has beenthe promising study to boost the domain-specific translationfrom the model trained on mixture of in-domain and out-of-domain data

One of the dominant approaches is to select suitable datafor the target domain from a large general-domain corpus(general corpus) under the assumption that the generalcorpus is broad enough to cover somepercentage of sentencesthat fall in the target domain Then a domain-adaptedmachine translation system can be obtained by training onthe selected subset (Axelrod et al [2] defined it as pseudoin-domain subcorpus which will also be adopted in thispaper) instead of the entire data Our work mainly focuses

on these supplementary data selection approaches whichhave shown significant improvements in the constructionof domain-adapted SMT models Models trained on suchappreciated data will be benefited to improve the quality ofword alignments Secondly extraction of a large amount ofirrelevant phrase pairs can be prevented and thirdly theestimation of reordering factors for target sentences can alsobe fairly optimized

Data selection is often used for language model (LM) andtranslation model (TM) optimization It generally comprisesthree processing stages in order to translate the domain-specific data119876 using a general-domain parallel corpus 119866 anda general-domain monolingual corpus119872 in target language

(1) Scoring The relevance of each sentence pair ⟨119878119894 119879119894⟩ in 119866

to the target domain can be estimated by various similaritymetrics which can be uniformly stated as follows

Score (119878119894 119879119894) 997888rarr Sim (119881

119894 119877) (1)

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 745485 10 pageshttpdxdoiorg1011552014745485

2 The Scientific World Journal

where the sentences in source 119878119894 target 119879

119894 or both sides can

be considered for similarity measuring thus we uniformlydefine them as 119881

119894 And 119877 is an abstract model to represent

the target domain These various applications of (1) will bedetailed in Section 3

(2) Resampling Each scored sentence pair that is given a highweight value will be kept otherwise will be removed fromthe pseudo in-domain subcorpus according to a binary filterfunction

Filter (Score119894) =

1 Score119894gt 120579

0 Others(2)

in which Score119894is the similarity value of the 119894th sentence pair

⟨119878119894 119879119894⟩ according to (1) and 120579 is a tunable threshold It gives 1

for ⟨119878119894 119879119894⟩ with score higher than 120579 and zero otherwise

Note that the first two steps can also be applied to theselection of 119872 for the construction of the LM The onlydifference is that (1) and (2) can only consider a sentence 119871

119894

of119872 in monolingual environment

(3) TranslationAfter collecting a certain amount of ⟨119878119894 119879119894⟩ or

⟨119871119894⟩ TM and LM will be trained on these pseudo in-domain

subsetsA phrase-based SMT model shown in (3) can be

employed to obtain the best translation

119879best = argmax119879

Pr (119879 | 119878)

= argmax119879

119872

sum119898=1

120582119898ℎ119898 (119879 119878)

(3)

where the ℎ119898(119905 119904) represents a feature function and 120582

119898is

the weight assigned to the corresponding feature functionIn general the SMT system uses a total of eight features an119899-gram LM two phrase translation probabilities two lexicaltranslation probabilities a word penalty a phrase penalty anda linear reordering penalty [3ndash5]

As similarity measuring has shown a great impact ontranslation qualities our goal is to find what kind of dataselection model can better benefit the domain-specific trans-lation effectively and robustly Therefore we systematicallyinvestigated and compared three state-of-the-art data selec-tion criteriaThe first two are the cosine tf-idf and perplexity-based criteria which are techniques of information retrievaland language modeling for SMT The third one is the edit-distance-based method which is to be explored for the firsttime for this special task

The results show that each individual criterion has thefollowing natural pros and cons

(i) Cosine tf-idf regards text as a bag of words andtries to retrieve similar sentences according to wordoverlap Although it is helpful to reduce the out-of-vocabulary (OOV) words this simple cooccurrencebased matching will result in weakness of filteringirrelevant data (noise)

(ii) Perplexity-based similaritymetrics employ an 119899-gramLM which considers not only the distribution of

Perplexity-based

Edit distanceWord position

Word order

Word overlapQuery sentence

(V)

Candidate sentence

(R)

Cosine tf-idf

RetrieveSim(V R

)

Figure 1 Data selection criteria pyramid

terms but also the collocation (119899-gram word order)It works well in balancing the OOVs and noiseby considering more factors But its performance issensitive to the quality of an in-domain LM as well asthe quantity of pseudo in-domain subcorpus

(iii) Edit-distance-based method is much stricter thanthe former two criteria because the factors of wordsoverlap order and position are all comprehensivelyconsidered for similarity measuring This seems tobe able to find the most similar sentences but itfinally fails to outperform the general baseline in ourexperiments

The more factors are considered the higher constraintdegree of similarity criteria are This can be depicted by apyramid Figure 1 shows the comparative depths of differentcriteria edit distance at the peak followed by perplexity-based and the cosine tf-idf The method would cover all thefactors underneath For instance perplexity-based methodsconsider both word overlap and word order in similaritymeasuring With considering more factors the criterion athigher level is stricter than the ones below

Although most SMT systems optimized by the presentedmethods can outperform the baseline system trained ongeneral corpus (general baseline) their improvements are stilleither unclear or unstable (in Section 2) It is hard for theseindividual models to be applied in a real-life system thus acombination approach is proposed as a trade-offWe combineall the presented (cosine tf-idf perplexity-based and edit-distance-based criteria by performing linear interpolationat two levels (i) corpus level where we join the pseudo in-domain subcorpora selected by different similarity metricsat the resampling stage and (ii) model level where multiplemodels trained on pseudo in-domain subsets retrieved viadifferent similarity metrics are combined together at trans-lation stage Since TM adaptation and LM adaptation maybenefit each other [6] we combine these two adaptationapproaches

Comparative experiments were conducted using a largeChinese-English general corpus where SMTs are trained andadapted to translate the in-domain sentences of the Hong

The Scientific World Journal 3

Kong law statements We measure the influence of differ-ent data selection methods on the final translation outputUsing BLEU [7] as an evaluation metric results indicatethat our proposed approach successfully obtains a robustperformance which is still better than any single individualmodel as well as the baseline system Although the edit-distance-based technique does not outperform the others inthe individual setting the selected data by this techniquecan complement the data retrieved by the other techniquesas demonstrated in the combined scenario Although theedit-distance-based technique is the strictest one amongthe presented criteria it fails to outperform others for thedomain-specific translation task

If a small in-domain corpus is available it is goodto use this to select pseudo in-domain data from generalcorpus as well as improve the current translation sys-tems via combination methods Thus we further combinethe adapted system trained on pseudo in-domain data withthe one trained on a small in-domain corpus using multiplepaths decoding technique (as discussed in Section 2) Finallywe show that combining this small in-domainTMcan furtherimprove the best domain-adapted SMT system by up to 121BLEU points and is also better than the general + in-domainsystem by at most 331 points

The remainder of this paper is organized as follows Wefirstly review the related work in Section 2The proposed andother related similarity models are described in Section 3Section 4 details the configurations of experiments Finallywe compare and discuss the results in Section 5 followed bythe conclusions to end the paper

2 Related Work

Researchers discussed the domain adaptation problems forSMT in various perspectives such as mining unknown wordsfrom comparable corpora [8] weighted phrase extraction[9] corpus weighting [10] and mixing multiple models [11ndash13] Actually data selection is one of the corpus weightingmethods (there are data selection data weighting and trans-lation model adaptation) [14] In this section we will revisitthe state-of-the-art selection criteria and our combinationtechniques

21 Data Selection Criteria The first selection criterion is thecosine tf-idf (term frequency-inverse document frequency)similarity which comes from the realm of informationretrieval (IR) Hildebrand et al [15] applied this IR techniqueto select less but more similar sentences to the adaptation ofTM and LM They concluded that it is possible to adapt thismethod to improve the translation performance especially inthe LM adaptation Similar to the experiments described inthis paper Lu et al [6] proposed resampling and reweightingmethods for online and offline TM optimization which arecloser to a real-life SMT system Furthermore their resultsindicated that duplicated sentences can affect the translationsThey obtained about 1 BLEU point improvement using 60of total data In this study we still consider using duplicatedsentences as a hidden factor for all the presented methods

The second category is perplexity-based approach whichcan be found in the field of language modeling This hasbeen adapted by Lin et al [16] and Gao et al [17] in whichperplexity is used to score text segments according to an in-domain LM More recently Moore and Lewis [18] derivedthe cross-entropy difference metric from a simple variantof Bayes rule However this is a preliminary study that didnot yet show an improvement for MT task The method wasfurther developed by Axelrod et al [2] for SMT adaptationThey also presented a novel bilingual method and comparedit with other variants The experimental results show that thefast and simple technique allows to discard over 99 of thegeneral corpus resulted in an increase of 18 BLEU pointsHowever they tested various perplexity-based methods onlyusing limited threshold in (2) settings which may not reflecttheir overall performances

In addition the previous work often separately discussedrelated methods on TM [2] or LM [15] But Lu et al[6] pointed out that combining LM and TM adaptationapproaches could further improve their performance In thispaper we optimized both TM and LM via data selectionmethods

The third retrieval model is not explicitly used for SMTbut is still applicable to our scenario Edit distance (ED)is a widely used similarity measure for example-based MT(EBMT) known as Levenshtein distance (LD) [19] Koehnand Senellart [20] applied this method for convergence oftranslationmemory (TM) and SMT Leveling et al [21] inves-tigated different approximated sentence retrieval approaches(eg LD and standard IR) for EBMT Both papers gave theexact formula of fuzzy matching This inspires us to regardthe metric as a new data selection criterion for SMT domainadaptation task Good performance could be expected underthe assumption that the general corpus is big enough to coverthe very similar sentences with respect to the test data

22 Combination Methods The existing domain adaptationmethods can be summarized into two broad categories (i)corpus level by selecting joining or weighting the datasetsupon which the models are trained and (ii) model level bycombining multiple models together in a weighted manner[2] In corpus level combination the sentence frequency isoften used as a weighting scheme Hildebrand et al [15]allowed duplicated sentences in selected dataset which is ahidden bias to in-domain data Furthermore Lu et al [6] gaveeach relevant sentence a higher integer weight in GIZA++file Mixture modeling approach is a standard technique inmachine learning [22] Foster and Kuhn [12] interpolatedthe multiple models together by performing linear and log-linear weights on the entries of phrase tables Interpolationmethod is also often used in pivot-based SMT to combinestandard and pivot models [23] In order to investigate thebest combination on these data selection criteria we comparedifferent combination methods at both corpus level andmodel level

Furthermore the additional in-domain corpora couldalso be used to enhance current TMs for domain adaptationLinear interpolation (similar to combination at model level)

4 The Scientific World Journal

could be applied to combine the in-domain TM with thedomain-adapted TM However directly concatenating thephrase tables into one may lead to unpredictable behaviorduring decoding In addition Axelrod et al [2] have con-cluded that it is more effective to use multidecoding method[24] than linear interpolation Thus we prefer to employ theconfusion ofmultiple phrase tables at decodingmodel for thistask

3 Model Description

This section describes four data selection criteria respec-tively based on cosine tf-idf perplexity and edit distance aswell as the proposed approach

31 Cosine tf-idf Each document119863119894is represented as a vector

(1199081198941 1199081198942 119908

119894119899) and 119899 is the size of the vocabulary So119908

119894119895is

calculated as follows

119908119894119895= tf119894119895times log (idf

119895) (4)

in which tf119894119895is the term frequency (TF) of the 119895th word in

the vocabulary in the document 119863119894and idf

119895is the inverse

document frequency (IDF) of the 119895th word calculated Thesimilarity between two texts is then defined as the cosine ofthe angle between two vectors We implemented this withApache Lucene (available at httpluceneapacheorg) andit is very similar to the experiments described in [6 15]Suppose that119872 is the size of query set and119873 is the numberof sentences retrieved from general corpus according to eachquery Thus the size of the cosine tf-idf based pseudo in-domain subcorpus is SizeCos-IR = 119872 times119873

32 Perplexity Perplexity is based on the cross-entropy (5)which is the average of the negative logarithm of the wordprobabilities Consider

119867(119901 119902) = minus

119899

sum

119894=1

119901 (119908119894) log 119902 (119908

119894)

= minus1

119873

119899

sum

119894=1

log 119902 (119908119894)

(5)

where 119901 denotes the empirical distribution of the test sample119901(119908119894) = 119899119873 if 119908

119894appeared 119899 times in the test sample

of size 119873 119902(119908119894) is the probability of event 119908

119894estimated

from the training set Thus the perplexity pp can be simplytransformed as

pp = 119887119867(119901119902) (6)

where 119887 is the base with respect to which the cross-entropyis measured (eg bits or nats) 119867(119901 119902) is the cross-entropygiven in (5) which is often applied as a cosmetic substitute ofperplexity for data selection [2 18]

Let 119867119868(119901 119902) and 119867

119874(119901 119902) be the cross-entropy of string

119908119894according to language model LM

119868and LM

119874which are

respectively trained by in-domain dataset 119868 and general-domain dataset 119866 Considering the source (src) and target

(tgt) sides of training data there are three perplexity-basedvariants The first is called basic cross-entropy given by

119867119868-src (119901 119902) (7)

The second one is Moore-Lewis cross-entropy difference[18]

119867119868-src (119901 119902) minus 119867119866-src (119901 119902) (8)

which tries to select the sentences that are more similar to119868 but different to others in 119866 All the above two criteriaonly consider the sentences in source language FurthermoreAxelrod et al [2] proposed a metric that sums cross-entropydifference over both sides

[119867119868-src (119901 119902) minus 119867119866-src (119901 119902)]

+ [119867119868-tgt (119901 119902) minus 119867119866-tgt (119901 119902)]

(9)

The candidates with lower scores (obtained by (7) (8)and (9)) have higher relevance to target domain The size ofthe perplexity-based pseudo in-domain subset SizePP shouldbe equal to SizeCos-IR In practice we perform SRILM toolkit(available at httpwwwspeechsricomprojectssrilm) [25]to conduct 5-gram LMs with interpolated modified Kneser-Ney discounting [26]

33 Edit Distance Given a sentence 119904119866from119866 and a sentence

119904119877from the test set or in-domain corpus the edit distance for

these two sequences is defined as the minimum number ofedits that is symbol insertions deletions and substitutionsneeded to transform 119904

119866into 119904

119877 Based on Levenshtein

distance or edit distance there are several different imple-mentations We used the normalized Levenshtein similarityscore (fuzzy matching score FMS)

FMS = 1 minusLD (119904119866 119904119877)

Max (10038161003816100381610038161199041198661003816100381610038161003816 10038161003816100381610038161199041198771003816100381610038161003816) (10)

where |119904119866| and |119904

119877| are the lengths 119904

119866and 119904119877[20 21] In this

study we employed theword-based Levenshtein edit distancefunction If there is a sentence of which its score exceeds athreshold we would further penalize these sentences accord-ing to the whitespace and punctuations editing differencesThe algorithm is implemented in map reduce to parallelizethe process and shorten the processing time

34 Proposed Model For corpus level combination werespectively perform GIZA++ toolkit (httpwwwstatmtorgmosesgizaGIZA++html) on pseudo in-domain sub-corpora selected by different data selection methods Theneach sentence in different subcorpora can be weighted bymodifying its corresponding occurrences in the GIZA++ file[6] Finally we combine these GIZA++ files together as a newone Formally this combination method can be defined asfollows

120572Cos IR (119878119909 119879119909)

cup120573PPBased (119878119910 119879119910)

cup120582EDBased (119878119911 119879119911)

(11)

The Scientific World Journal 5

where 120572 120573 and 120582 are weights for sentences pairs (119878119909 119879119909)

(119878119910 119879119910) and (119878

119911 119879119911) which are selected by cosine tf-

idf (Cos IR) perplexity-based (PPBased) and edit-distance-based (EDBased)methods respectively Note that the num-ber of sentence occurrence should be integer In practicewe made 120572 120573 and 120582 integer numbers and multiplied theseweights with the occurrences of each corresponding sentencein GIZA++ file

For model level combination we perform linear interpo-lation on the models trained with the subcorpora retrievedby different data selection methods The phrase translationprobability 120601(119905 | 119904) and the lexical weight 119901

119908(119905 | 119904 119886) are

estimated using (12) and (13) respectively as follows

120601 (119905 | 119904) =

119899

sum

119894=0

120572119894120601119894(119905 | 119904) (12)

119901119908(119905 | 119904 119886) =

119899

sum

119894=0

120573119894119901119908119894(119905 | 119904 119886) (13)

where 119894 = 1 2 and 3 denoting the phrase translationprobability and lexical weights trained on the subcorporaretrieved by Cos IR PP Based and EDBased model 119904 and 119905are the phrases in source and target language 119886 is the align-ment information 120572

119894and 120573

119894are the tunable interpolation

parameters subject to sum120572119894= sum120573

119894= 1

In this paper we aremore interested in the comparison ofvarious data selection criteria rather than the best combina-tion method thus we did not tune these weights in the aboveequations and gave them equal weights (At corpus level 120572 =120573 = 120582 = 1 At model level 120572

119894= 120573119894= 13) in experiments

Moreover we will see that even these simple combinationscan lead to an ideal improvement of translation quality

4 Experimental Setup

41 Corpora Two corpora are needed for the domain adap-tation task The general corpus includes more than 1 millionparallel sentences comprising varieties of genres such asnewswires (LDC2005T10) translation example from dictio-naries law statements and sentences from online sourcesThe domain distribution of the general corpus is shownin Table 1 The miscellaneous part includes the sentencescrawled from various materials and the law portion includesthe articles collected from Chinese mainland Hong KongandMacau which have different lawsThe in-domain corpusdevelopment set and test set are randomly selected (that aredisjoined) from the Hong Kong law corpus (LDC2004T08)The size of the test set in-domain corpus and general corpuswe used is summarized in Table 2

In the preprocessing the English texts are tokenizedby Moses scripts (scripts are available at httpwwwstatmtorgeuroparl) [27] and the Chinese texts are seg-mented by NLPIR (available at httpictclasnlpirorg) [28]We removed the sentences of which length is more than 80Furthermore we only used the target side of data in generalcorpus as the general-domain LM training corpus Thus thetarget side of the pseudo in-domain corpus obtained by the

Table 1 Proportions of domains of general corpus

Domain Sent number News 279962 2460Novel 304932 2679Law 48754 428Miscellaneous 504396 4433Total 1138044 10000

previous methods (in Section 3) could be directly used fortraining adapted LM

42 Systems Description The experiments presented in thispaper were carried out with the Moses toolkit [29] a state-of-the-art open-source phrase-based SMT systemThe trans-lation and the reordering model relied on ldquogrow-diag-finalrdquosymmetrized word-to-word alignments built using GIZA++[30] and the training script of Moses The weights of the log-linear model were optimized by means of MERT [31]

A 5-gram language model was trained using the IRSTLMtoolkit [32] exploiting improved modified Kneser-Neysmoothing and quantizing both probabilities and back-offweights

43 Settings For the comparison totally five existing rep-resentative data selection models three baseline systemsand the proposed model were selected The correspondingsettings of the above models are as follows

(i) Baseline the baseline systems were trained withthe toolkits and settings as described in Section 42The in-domain baseline (IC-baseline) and general-domain baseline (GC-baseline) were respectivelytrained on in-domain corpus and general corpusThen a combined baseline system (GI-baseline) wascreated by passing the above two phrase tables to thedecoder and using them in parallel

(ii) Individualmodel as described in Sections 31 32 and33 the individual models are cosine tf-idf (Cos-IR)and fuzzy matching scorer (FMS) which is an edit-distance-based (ED-Based) instance as well as threeperplexity-based (PP-Based) variants cross-entropy(CE) cross-entropy difference (CED) and bilingualcross-entropy difference (B-CED)

(iii) Proposed model as described in Section 34 wecombined Cos-IR PP-Based and ED-Basedmethodsat corpus level (named iTPB-C) and model level(named iTPB-M)

The test set development set [6 15] and in-domaincorpus [2 18] can be used to select data from general-domaincorpus The first strategy is under the assumption that theinput data is known before building the models But in areal case we do not exactly know the test data in advanceTherefore we use the development set (dev strategy) andadditional in-domain corpus (in-domain strategy) which areidentical to the target domain to select data for TM and LMadaptation

6 The Scientific World Journal

Table 2 Corpora statistics

Data Set Lang Sentences Tokens Av len

Test set EN 2050 60399 2946ZH 59628 2909

Dev set EN 2000 59732 2926ZH 2000 59064 2907

In-domain EN 43621 1330464 2916ZH 1321655 2897

Training set EN 1138044 28626367 2515ZH 28239747 2481

5 Results and Discussions

For eachmethodwe used the119873-bests selection results where119873 = 20K 40K 80K 160K 320K 640K sentence pairs outof the 11M from the general corpus (roughly 175 3570 140 280 and 56 of general-domain corpus where119896 is short for thousand and 119898 is short for million) In thissection we firstly discuss three PPBased variants and thencompare the best one with other two models Cos-IR andFMS After that combined model and the best individual arecompared Finally the performance is further improved by asimple but effective system combination method

51 Baselines The baseline results (in Table 3) show that atranslation system trained on the general corpus outperformsthe one trained on the in-domain corpus over 285 BLEUpoints The main reason is that general corpus is so broadthat the OOVs aremuch lesser Combining these two systemscould further increase the GC-baseline by 191 points

52 IndividualModel Thecurves in Figure 2 show that all thethree PP Based variants that is CE CED and B-CED couldbe used to train domain-adapted SMT systems Using lesstraining data they still obtain better performance than GC-baseline CE is the simplest PP Based one and improves theGC-baseline by at most 164 points when selecting 320K outof 11M sentences Although CED gives bias to the sentencesthat are more similar to the target domain its performanceseems unstable with dev strategy and the worst with in-domain strategy This indicates that it is not suitable torandomly select data from general corpus for training LM

119874

(in Section 32) Similar to the conclusion in [2] B-CEDachieves the highest improvement with least selected dataamong the three variants It proves that bilingual resourcesare helpful to balance OOVs and noise However when usingin-domain corpus to select pseudo in-domain subcorporatheir trends are similar but worse They have to enlarge thesize of selected data to perform nearly as well as the resultsunder dev strategy Therefore PP Based approaches may notwork well for a real-life SMT application

Next we use B-CED (the best one among PPBased vari-ants) to compare with other selection criteria As shown inFigure 3 Cos-IR improves by at most 102 (dev) and 088 (in-domain) BLEU points using 28 data of the general corpusThe results approximately match with the conclusions given

Table 3 BLEU via general-domain and in-domain corpus

Baseline Sentences BLEUGC-baseline 11M 3915IC-baseline 45K 3630GI-baseline 4106

Table 4 Translation results of iTPB at different combination levels

Methods Sent BLEU(dev set)

BLEU(in-domain set)

B-CED80K 4091 3550160K 4112 3947320K 4002 4098

iTPB-C80K 4225 3939160K 4304 4187320K 4242 4044

iTPB-M80K 4293 4057160K 4365 4195320K 4397 4221

by [6 15] showing that keywords overlap (in Figure 1) playsa significant role in retrieving sentences in similar domainsHowever it still needs a large amount of selected data (morethan 280) to obtain an ideal performanceThemain reasonmay be that the sentences including same keywords still maybe irrelevant For example two sentences share the samephrase ldquoaccording to the articlerdquo but one sentence is in thelegal domain and the other is from news Figure 3 also showsthat FMS fails to outperform theGC-baseline system even it ismuch stricter than other criteriaWhen addingword positionfactor into similarity measuring FMS tries to find highlysimilar sentences on length collocation and even semanticsBut it seems that our general corpus is not large enoughto cover a certain amount of FMS-similar sentences Withthe size of general or in-domain corpus increases it maybenefit the translation quality because FMS still works betterthan IC-baseline which proves its positive impact on filteringnoise

Among the three presented criteria PP Based can achievethe highest BLEU with considering an appropriate amountof factors for similarity measuring However the curvesshow that it depends heavily upon the threshold 120579 in (2)

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

2 The Scientific World Journal

where the sentences in source 119878119894 target 119879

119894 or both sides can

be considered for similarity measuring thus we uniformlydefine them as 119881

119894 And 119877 is an abstract model to represent

the target domain These various applications of (1) will bedetailed in Section 3

(2) Resampling Each scored sentence pair that is given a highweight value will be kept otherwise will be removed fromthe pseudo in-domain subcorpus according to a binary filterfunction

Filter (Score119894) =

1 Score119894gt 120579

0 Others(2)

in which Score119894is the similarity value of the 119894th sentence pair

⟨119878119894 119879119894⟩ according to (1) and 120579 is a tunable threshold It gives 1

for ⟨119878119894 119879119894⟩ with score higher than 120579 and zero otherwise

Note that the first two steps can also be applied to theselection of 119872 for the construction of the LM The onlydifference is that (1) and (2) can only consider a sentence 119871

119894

of119872 in monolingual environment

(3) TranslationAfter collecting a certain amount of ⟨119878119894 119879119894⟩ or

⟨119871119894⟩ TM and LM will be trained on these pseudo in-domain

subsetsA phrase-based SMT model shown in (3) can be

employed to obtain the best translation

119879best = argmax119879

Pr (119879 | 119878)

= argmax119879

119872

sum119898=1

120582119898ℎ119898 (119879 119878)

(3)

where the ℎ119898(119905 119904) represents a feature function and 120582

119898is

the weight assigned to the corresponding feature functionIn general the SMT system uses a total of eight features an119899-gram LM two phrase translation probabilities two lexicaltranslation probabilities a word penalty a phrase penalty anda linear reordering penalty [3ndash5]

As similarity measuring has shown a great impact ontranslation qualities our goal is to find what kind of dataselection model can better benefit the domain-specific trans-lation effectively and robustly Therefore we systematicallyinvestigated and compared three state-of-the-art data selec-tion criteriaThe first two are the cosine tf-idf and perplexity-based criteria which are techniques of information retrievaland language modeling for SMT The third one is the edit-distance-based method which is to be explored for the firsttime for this special task

The results show that each individual criterion has thefollowing natural pros and cons

(i) Cosine tf-idf regards text as a bag of words andtries to retrieve similar sentences according to wordoverlap Although it is helpful to reduce the out-of-vocabulary (OOV) words this simple cooccurrencebased matching will result in weakness of filteringirrelevant data (noise)

(ii) Perplexity-based similaritymetrics employ an 119899-gramLM which considers not only the distribution of

Perplexity-based

Edit distanceWord position

Word order

Word overlapQuery sentence

(V)

Candidate sentence

(R)

Cosine tf-idf

RetrieveSim(V R

)

Figure 1 Data selection criteria pyramid

terms but also the collocation (119899-gram word order)It works well in balancing the OOVs and noiseby considering more factors But its performance issensitive to the quality of an in-domain LM as well asthe quantity of pseudo in-domain subcorpus

(iii) Edit-distance-based method is much stricter thanthe former two criteria because the factors of wordsoverlap order and position are all comprehensivelyconsidered for similarity measuring This seems tobe able to find the most similar sentences but itfinally fails to outperform the general baseline in ourexperiments

The more factors are considered the higher constraintdegree of similarity criteria are This can be depicted by apyramid Figure 1 shows the comparative depths of differentcriteria edit distance at the peak followed by perplexity-based and the cosine tf-idf The method would cover all thefactors underneath For instance perplexity-based methodsconsider both word overlap and word order in similaritymeasuring With considering more factors the criterion athigher level is stricter than the ones below

Although most SMT systems optimized by the presentedmethods can outperform the baseline system trained ongeneral corpus (general baseline) their improvements are stilleither unclear or unstable (in Section 2) It is hard for theseindividual models to be applied in a real-life system thus acombination approach is proposed as a trade-offWe combineall the presented (cosine tf-idf perplexity-based and edit-distance-based criteria by performing linear interpolationat two levels (i) corpus level where we join the pseudo in-domain subcorpora selected by different similarity metricsat the resampling stage and (ii) model level where multiplemodels trained on pseudo in-domain subsets retrieved viadifferent similarity metrics are combined together at trans-lation stage Since TM adaptation and LM adaptation maybenefit each other [6] we combine these two adaptationapproaches

Comparative experiments were conducted using a largeChinese-English general corpus where SMTs are trained andadapted to translate the in-domain sentences of the Hong

The Scientific World Journal 3

Kong law statements We measure the influence of differ-ent data selection methods on the final translation outputUsing BLEU [7] as an evaluation metric results indicatethat our proposed approach successfully obtains a robustperformance which is still better than any single individualmodel as well as the baseline system Although the edit-distance-based technique does not outperform the others inthe individual setting the selected data by this techniquecan complement the data retrieved by the other techniquesas demonstrated in the combined scenario Although theedit-distance-based technique is the strictest one amongthe presented criteria it fails to outperform others for thedomain-specific translation task

If a small in-domain corpus is available it is goodto use this to select pseudo in-domain data from generalcorpus as well as improve the current translation sys-tems via combination methods Thus we further combinethe adapted system trained on pseudo in-domain data withthe one trained on a small in-domain corpus using multiplepaths decoding technique (as discussed in Section 2) Finallywe show that combining this small in-domainTMcan furtherimprove the best domain-adapted SMT system by up to 121BLEU points and is also better than the general + in-domainsystem by at most 331 points

The remainder of this paper is organized as follows Wefirstly review the related work in Section 2The proposed andother related similarity models are described in Section 3Section 4 details the configurations of experiments Finallywe compare and discuss the results in Section 5 followed bythe conclusions to end the paper

2 Related Work

Researchers discussed the domain adaptation problems forSMT in various perspectives such as mining unknown wordsfrom comparable corpora [8] weighted phrase extraction[9] corpus weighting [10] and mixing multiple models [11ndash13] Actually data selection is one of the corpus weightingmethods (there are data selection data weighting and trans-lation model adaptation) [14] In this section we will revisitthe state-of-the-art selection criteria and our combinationtechniques

21 Data Selection Criteria The first selection criterion is thecosine tf-idf (term frequency-inverse document frequency)similarity which comes from the realm of informationretrieval (IR) Hildebrand et al [15] applied this IR techniqueto select less but more similar sentences to the adaptation ofTM and LM They concluded that it is possible to adapt thismethod to improve the translation performance especially inthe LM adaptation Similar to the experiments described inthis paper Lu et al [6] proposed resampling and reweightingmethods for online and offline TM optimization which arecloser to a real-life SMT system Furthermore their resultsindicated that duplicated sentences can affect the translationsThey obtained about 1 BLEU point improvement using 60of total data In this study we still consider using duplicatedsentences as a hidden factor for all the presented methods

The second category is perplexity-based approach whichcan be found in the field of language modeling This hasbeen adapted by Lin et al [16] and Gao et al [17] in whichperplexity is used to score text segments according to an in-domain LM More recently Moore and Lewis [18] derivedthe cross-entropy difference metric from a simple variantof Bayes rule However this is a preliminary study that didnot yet show an improvement for MT task The method wasfurther developed by Axelrod et al [2] for SMT adaptationThey also presented a novel bilingual method and comparedit with other variants The experimental results show that thefast and simple technique allows to discard over 99 of thegeneral corpus resulted in an increase of 18 BLEU pointsHowever they tested various perplexity-based methods onlyusing limited threshold in (2) settings which may not reflecttheir overall performances

In addition the previous work often separately discussedrelated methods on TM [2] or LM [15] But Lu et al[6] pointed out that combining LM and TM adaptationapproaches could further improve their performance In thispaper we optimized both TM and LM via data selectionmethods

The third retrieval model is not explicitly used for SMTbut is still applicable to our scenario Edit distance (ED)is a widely used similarity measure for example-based MT(EBMT) known as Levenshtein distance (LD) [19] Koehnand Senellart [20] applied this method for convergence oftranslationmemory (TM) and SMT Leveling et al [21] inves-tigated different approximated sentence retrieval approaches(eg LD and standard IR) for EBMT Both papers gave theexact formula of fuzzy matching This inspires us to regardthe metric as a new data selection criterion for SMT domainadaptation task Good performance could be expected underthe assumption that the general corpus is big enough to coverthe very similar sentences with respect to the test data

22 Combination Methods The existing domain adaptationmethods can be summarized into two broad categories (i)corpus level by selecting joining or weighting the datasetsupon which the models are trained and (ii) model level bycombining multiple models together in a weighted manner[2] In corpus level combination the sentence frequency isoften used as a weighting scheme Hildebrand et al [15]allowed duplicated sentences in selected dataset which is ahidden bias to in-domain data Furthermore Lu et al [6] gaveeach relevant sentence a higher integer weight in GIZA++file Mixture modeling approach is a standard technique inmachine learning [22] Foster and Kuhn [12] interpolatedthe multiple models together by performing linear and log-linear weights on the entries of phrase tables Interpolationmethod is also often used in pivot-based SMT to combinestandard and pivot models [23] In order to investigate thebest combination on these data selection criteria we comparedifferent combination methods at both corpus level andmodel level

Furthermore the additional in-domain corpora couldalso be used to enhance current TMs for domain adaptationLinear interpolation (similar to combination at model level)

4 The Scientific World Journal

could be applied to combine the in-domain TM with thedomain-adapted TM However directly concatenating thephrase tables into one may lead to unpredictable behaviorduring decoding In addition Axelrod et al [2] have con-cluded that it is more effective to use multidecoding method[24] than linear interpolation Thus we prefer to employ theconfusion ofmultiple phrase tables at decodingmodel for thistask

3 Model Description

This section describes four data selection criteria respec-tively based on cosine tf-idf perplexity and edit distance aswell as the proposed approach

31 Cosine tf-idf Each document119863119894is represented as a vector

(1199081198941 1199081198942 119908

119894119899) and 119899 is the size of the vocabulary So119908

119894119895is

calculated as follows

119908119894119895= tf119894119895times log (idf

119895) (4)

in which tf119894119895is the term frequency (TF) of the 119895th word in

the vocabulary in the document 119863119894and idf

119895is the inverse

document frequency (IDF) of the 119895th word calculated Thesimilarity between two texts is then defined as the cosine ofthe angle between two vectors We implemented this withApache Lucene (available at httpluceneapacheorg) andit is very similar to the experiments described in [6 15]Suppose that119872 is the size of query set and119873 is the numberof sentences retrieved from general corpus according to eachquery Thus the size of the cosine tf-idf based pseudo in-domain subcorpus is SizeCos-IR = 119872 times119873

32 Perplexity Perplexity is based on the cross-entropy (5)which is the average of the negative logarithm of the wordprobabilities Consider

119867(119901 119902) = minus

119899

sum

119894=1

119901 (119908119894) log 119902 (119908

119894)

= minus1

119873

119899

sum

119894=1

log 119902 (119908119894)

(5)

where 119901 denotes the empirical distribution of the test sample119901(119908119894) = 119899119873 if 119908

119894appeared 119899 times in the test sample

of size 119873 119902(119908119894) is the probability of event 119908

119894estimated

from the training set Thus the perplexity pp can be simplytransformed as

pp = 119887119867(119901119902) (6)

where 119887 is the base with respect to which the cross-entropyis measured (eg bits or nats) 119867(119901 119902) is the cross-entropygiven in (5) which is often applied as a cosmetic substitute ofperplexity for data selection [2 18]

Let 119867119868(119901 119902) and 119867

119874(119901 119902) be the cross-entropy of string

119908119894according to language model LM

119868and LM

119874which are

respectively trained by in-domain dataset 119868 and general-domain dataset 119866 Considering the source (src) and target

(tgt) sides of training data there are three perplexity-basedvariants The first is called basic cross-entropy given by

119867119868-src (119901 119902) (7)

The second one is Moore-Lewis cross-entropy difference[18]

119867119868-src (119901 119902) minus 119867119866-src (119901 119902) (8)

which tries to select the sentences that are more similar to119868 but different to others in 119866 All the above two criteriaonly consider the sentences in source language FurthermoreAxelrod et al [2] proposed a metric that sums cross-entropydifference over both sides

[119867119868-src (119901 119902) minus 119867119866-src (119901 119902)]

+ [119867119868-tgt (119901 119902) minus 119867119866-tgt (119901 119902)]

(9)

The candidates with lower scores (obtained by (7) (8)and (9)) have higher relevance to target domain The size ofthe perplexity-based pseudo in-domain subset SizePP shouldbe equal to SizeCos-IR In practice we perform SRILM toolkit(available at httpwwwspeechsricomprojectssrilm) [25]to conduct 5-gram LMs with interpolated modified Kneser-Ney discounting [26]

33 Edit Distance Given a sentence 119904119866from119866 and a sentence

119904119877from the test set or in-domain corpus the edit distance for

these two sequences is defined as the minimum number ofedits that is symbol insertions deletions and substitutionsneeded to transform 119904

119866into 119904

119877 Based on Levenshtein

distance or edit distance there are several different imple-mentations We used the normalized Levenshtein similarityscore (fuzzy matching score FMS)

FMS = 1 minusLD (119904119866 119904119877)

Max (10038161003816100381610038161199041198661003816100381610038161003816 10038161003816100381610038161199041198771003816100381610038161003816) (10)

where |119904119866| and |119904

119877| are the lengths 119904

119866and 119904119877[20 21] In this

study we employed theword-based Levenshtein edit distancefunction If there is a sentence of which its score exceeds athreshold we would further penalize these sentences accord-ing to the whitespace and punctuations editing differencesThe algorithm is implemented in map reduce to parallelizethe process and shorten the processing time

34 Proposed Model For corpus level combination werespectively perform GIZA++ toolkit (httpwwwstatmtorgmosesgizaGIZA++html) on pseudo in-domain sub-corpora selected by different data selection methods Theneach sentence in different subcorpora can be weighted bymodifying its corresponding occurrences in the GIZA++ file[6] Finally we combine these GIZA++ files together as a newone Formally this combination method can be defined asfollows

120572Cos IR (119878119909 119879119909)

cup120573PPBased (119878119910 119879119910)

cup120582EDBased (119878119911 119879119911)

(11)

The Scientific World Journal 5

where 120572 120573 and 120582 are weights for sentences pairs (119878119909 119879119909)

(119878119910 119879119910) and (119878

119911 119879119911) which are selected by cosine tf-

idf (Cos IR) perplexity-based (PPBased) and edit-distance-based (EDBased)methods respectively Note that the num-ber of sentence occurrence should be integer In practicewe made 120572 120573 and 120582 integer numbers and multiplied theseweights with the occurrences of each corresponding sentencein GIZA++ file

For model level combination we perform linear interpo-lation on the models trained with the subcorpora retrievedby different data selection methods The phrase translationprobability 120601(119905 | 119904) and the lexical weight 119901

119908(119905 | 119904 119886) are

estimated using (12) and (13) respectively as follows

120601 (119905 | 119904) =

119899

sum

119894=0

120572119894120601119894(119905 | 119904) (12)

119901119908(119905 | 119904 119886) =

119899

sum

119894=0

120573119894119901119908119894(119905 | 119904 119886) (13)

where 119894 = 1 2 and 3 denoting the phrase translationprobability and lexical weights trained on the subcorporaretrieved by Cos IR PP Based and EDBased model 119904 and 119905are the phrases in source and target language 119886 is the align-ment information 120572

119894and 120573

119894are the tunable interpolation

parameters subject to sum120572119894= sum120573

119894= 1

In this paper we aremore interested in the comparison ofvarious data selection criteria rather than the best combina-tion method thus we did not tune these weights in the aboveequations and gave them equal weights (At corpus level 120572 =120573 = 120582 = 1 At model level 120572

119894= 120573119894= 13) in experiments

Moreover we will see that even these simple combinationscan lead to an ideal improvement of translation quality

4 Experimental Setup

41 Corpora Two corpora are needed for the domain adap-tation task The general corpus includes more than 1 millionparallel sentences comprising varieties of genres such asnewswires (LDC2005T10) translation example from dictio-naries law statements and sentences from online sourcesThe domain distribution of the general corpus is shownin Table 1 The miscellaneous part includes the sentencescrawled from various materials and the law portion includesthe articles collected from Chinese mainland Hong KongandMacau which have different lawsThe in-domain corpusdevelopment set and test set are randomly selected (that aredisjoined) from the Hong Kong law corpus (LDC2004T08)The size of the test set in-domain corpus and general corpuswe used is summarized in Table 2

In the preprocessing the English texts are tokenizedby Moses scripts (scripts are available at httpwwwstatmtorgeuroparl) [27] and the Chinese texts are seg-mented by NLPIR (available at httpictclasnlpirorg) [28]We removed the sentences of which length is more than 80Furthermore we only used the target side of data in generalcorpus as the general-domain LM training corpus Thus thetarget side of the pseudo in-domain corpus obtained by the

Table 1 Proportions of domains of general corpus

Domain Sent number News 279962 2460Novel 304932 2679Law 48754 428Miscellaneous 504396 4433Total 1138044 10000

previous methods (in Section 3) could be directly used fortraining adapted LM

42 Systems Description The experiments presented in thispaper were carried out with the Moses toolkit [29] a state-of-the-art open-source phrase-based SMT systemThe trans-lation and the reordering model relied on ldquogrow-diag-finalrdquosymmetrized word-to-word alignments built using GIZA++[30] and the training script of Moses The weights of the log-linear model were optimized by means of MERT [31]

A 5-gram language model was trained using the IRSTLMtoolkit [32] exploiting improved modified Kneser-Neysmoothing and quantizing both probabilities and back-offweights

43 Settings For the comparison totally five existing rep-resentative data selection models three baseline systemsand the proposed model were selected The correspondingsettings of the above models are as follows

(i) Baseline the baseline systems were trained withthe toolkits and settings as described in Section 42The in-domain baseline (IC-baseline) and general-domain baseline (GC-baseline) were respectivelytrained on in-domain corpus and general corpusThen a combined baseline system (GI-baseline) wascreated by passing the above two phrase tables to thedecoder and using them in parallel

(ii) Individualmodel as described in Sections 31 32 and33 the individual models are cosine tf-idf (Cos-IR)and fuzzy matching scorer (FMS) which is an edit-distance-based (ED-Based) instance as well as threeperplexity-based (PP-Based) variants cross-entropy(CE) cross-entropy difference (CED) and bilingualcross-entropy difference (B-CED)

(iii) Proposed model as described in Section 34 wecombined Cos-IR PP-Based and ED-Basedmethodsat corpus level (named iTPB-C) and model level(named iTPB-M)

The test set development set [6 15] and in-domaincorpus [2 18] can be used to select data from general-domaincorpus The first strategy is under the assumption that theinput data is known before building the models But in areal case we do not exactly know the test data in advanceTherefore we use the development set (dev strategy) andadditional in-domain corpus (in-domain strategy) which areidentical to the target domain to select data for TM and LMadaptation

6 The Scientific World Journal

Table 2 Corpora statistics

Data Set Lang Sentences Tokens Av len

Test set EN 2050 60399 2946ZH 59628 2909

Dev set EN 2000 59732 2926ZH 2000 59064 2907

In-domain EN 43621 1330464 2916ZH 1321655 2897

Training set EN 1138044 28626367 2515ZH 28239747 2481

5 Results and Discussions

For eachmethodwe used the119873-bests selection results where119873 = 20K 40K 80K 160K 320K 640K sentence pairs outof the 11M from the general corpus (roughly 175 3570 140 280 and 56 of general-domain corpus where119896 is short for thousand and 119898 is short for million) In thissection we firstly discuss three PPBased variants and thencompare the best one with other two models Cos-IR andFMS After that combined model and the best individual arecompared Finally the performance is further improved by asimple but effective system combination method

51 Baselines The baseline results (in Table 3) show that atranslation system trained on the general corpus outperformsthe one trained on the in-domain corpus over 285 BLEUpoints The main reason is that general corpus is so broadthat the OOVs aremuch lesser Combining these two systemscould further increase the GC-baseline by 191 points

52 IndividualModel Thecurves in Figure 2 show that all thethree PP Based variants that is CE CED and B-CED couldbe used to train domain-adapted SMT systems Using lesstraining data they still obtain better performance than GC-baseline CE is the simplest PP Based one and improves theGC-baseline by at most 164 points when selecting 320K outof 11M sentences Although CED gives bias to the sentencesthat are more similar to the target domain its performanceseems unstable with dev strategy and the worst with in-domain strategy This indicates that it is not suitable torandomly select data from general corpus for training LM

119874

(in Section 32) Similar to the conclusion in [2] B-CEDachieves the highest improvement with least selected dataamong the three variants It proves that bilingual resourcesare helpful to balance OOVs and noise However when usingin-domain corpus to select pseudo in-domain subcorporatheir trends are similar but worse They have to enlarge thesize of selected data to perform nearly as well as the resultsunder dev strategy Therefore PP Based approaches may notwork well for a real-life SMT application

Next we use B-CED (the best one among PPBased vari-ants) to compare with other selection criteria As shown inFigure 3 Cos-IR improves by at most 102 (dev) and 088 (in-domain) BLEU points using 28 data of the general corpusThe results approximately match with the conclusions given

Table 3 BLEU via general-domain and in-domain corpus

Baseline Sentences BLEUGC-baseline 11M 3915IC-baseline 45K 3630GI-baseline 4106

Table 4 Translation results of iTPB at different combination levels

Methods Sent BLEU(dev set)

BLEU(in-domain set)

B-CED80K 4091 3550160K 4112 3947320K 4002 4098

iTPB-C80K 4225 3939160K 4304 4187320K 4242 4044

iTPB-M80K 4293 4057160K 4365 4195320K 4397 4221

by [6 15] showing that keywords overlap (in Figure 1) playsa significant role in retrieving sentences in similar domainsHowever it still needs a large amount of selected data (morethan 280) to obtain an ideal performanceThemain reasonmay be that the sentences including same keywords still maybe irrelevant For example two sentences share the samephrase ldquoaccording to the articlerdquo but one sentence is in thelegal domain and the other is from news Figure 3 also showsthat FMS fails to outperform theGC-baseline system even it ismuch stricter than other criteriaWhen addingword positionfactor into similarity measuring FMS tries to find highlysimilar sentences on length collocation and even semanticsBut it seems that our general corpus is not large enoughto cover a certain amount of FMS-similar sentences Withthe size of general or in-domain corpus increases it maybenefit the translation quality because FMS still works betterthan IC-baseline which proves its positive impact on filteringnoise

Among the three presented criteria PP Based can achievethe highest BLEU with considering an appropriate amountof factors for similarity measuring However the curvesshow that it depends heavily upon the threshold 120579 in (2)

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

The Scientific World Journal 3

Kong law statements We measure the influence of differ-ent data selection methods on the final translation outputUsing BLEU [7] as an evaluation metric results indicatethat our proposed approach successfully obtains a robustperformance which is still better than any single individualmodel as well as the baseline system Although the edit-distance-based technique does not outperform the others inthe individual setting the selected data by this techniquecan complement the data retrieved by the other techniquesas demonstrated in the combined scenario Although theedit-distance-based technique is the strictest one amongthe presented criteria it fails to outperform others for thedomain-specific translation task

If a small in-domain corpus is available it is goodto use this to select pseudo in-domain data from generalcorpus as well as improve the current translation sys-tems via combination methods Thus we further combinethe adapted system trained on pseudo in-domain data withthe one trained on a small in-domain corpus using multiplepaths decoding technique (as discussed in Section 2) Finallywe show that combining this small in-domainTMcan furtherimprove the best domain-adapted SMT system by up to 121BLEU points and is also better than the general + in-domainsystem by at most 331 points

The remainder of this paper is organized as follows Wefirstly review the related work in Section 2The proposed andother related similarity models are described in Section 3Section 4 details the configurations of experiments Finallywe compare and discuss the results in Section 5 followed bythe conclusions to end the paper

2 Related Work

Researchers discussed the domain adaptation problems forSMT in various perspectives such as mining unknown wordsfrom comparable corpora [8] weighted phrase extraction[9] corpus weighting [10] and mixing multiple models [11ndash13] Actually data selection is one of the corpus weightingmethods (there are data selection data weighting and trans-lation model adaptation) [14] In this section we will revisitthe state-of-the-art selection criteria and our combinationtechniques

21 Data Selection Criteria The first selection criterion is thecosine tf-idf (term frequency-inverse document frequency)similarity which comes from the realm of informationretrieval (IR) Hildebrand et al [15] applied this IR techniqueto select less but more similar sentences to the adaptation ofTM and LM They concluded that it is possible to adapt thismethod to improve the translation performance especially inthe LM adaptation Similar to the experiments described inthis paper Lu et al [6] proposed resampling and reweightingmethods for online and offline TM optimization which arecloser to a real-life SMT system Furthermore their resultsindicated that duplicated sentences can affect the translationsThey obtained about 1 BLEU point improvement using 60of total data In this study we still consider using duplicatedsentences as a hidden factor for all the presented methods

The second category is perplexity-based approach whichcan be found in the field of language modeling This hasbeen adapted by Lin et al [16] and Gao et al [17] in whichperplexity is used to score text segments according to an in-domain LM More recently Moore and Lewis [18] derivedthe cross-entropy difference metric from a simple variantof Bayes rule However this is a preliminary study that didnot yet show an improvement for MT task The method wasfurther developed by Axelrod et al [2] for SMT adaptationThey also presented a novel bilingual method and comparedit with other variants The experimental results show that thefast and simple technique allows to discard over 99 of thegeneral corpus resulted in an increase of 18 BLEU pointsHowever they tested various perplexity-based methods onlyusing limited threshold in (2) settings which may not reflecttheir overall performances

In addition the previous work often separately discussedrelated methods on TM [2] or LM [15] But Lu et al[6] pointed out that combining LM and TM adaptationapproaches could further improve their performance In thispaper we optimized both TM and LM via data selectionmethods

The third retrieval model is not explicitly used for SMTbut is still applicable to our scenario Edit distance (ED)is a widely used similarity measure for example-based MT(EBMT) known as Levenshtein distance (LD) [19] Koehnand Senellart [20] applied this method for convergence oftranslationmemory (TM) and SMT Leveling et al [21] inves-tigated different approximated sentence retrieval approaches(eg LD and standard IR) for EBMT Both papers gave theexact formula of fuzzy matching This inspires us to regardthe metric as a new data selection criterion for SMT domainadaptation task Good performance could be expected underthe assumption that the general corpus is big enough to coverthe very similar sentences with respect to the test data

22 Combination Methods The existing domain adaptationmethods can be summarized into two broad categories (i)corpus level by selecting joining or weighting the datasetsupon which the models are trained and (ii) model level bycombining multiple models together in a weighted manner[2] In corpus level combination the sentence frequency isoften used as a weighting scheme Hildebrand et al [15]allowed duplicated sentences in selected dataset which is ahidden bias to in-domain data Furthermore Lu et al [6] gaveeach relevant sentence a higher integer weight in GIZA++file Mixture modeling approach is a standard technique inmachine learning [22] Foster and Kuhn [12] interpolatedthe multiple models together by performing linear and log-linear weights on the entries of phrase tables Interpolationmethod is also often used in pivot-based SMT to combinestandard and pivot models [23] In order to investigate thebest combination on these data selection criteria we comparedifferent combination methods at both corpus level andmodel level

Furthermore the additional in-domain corpora couldalso be used to enhance current TMs for domain adaptationLinear interpolation (similar to combination at model level)

4 The Scientific World Journal

could be applied to combine the in-domain TM with thedomain-adapted TM However directly concatenating thephrase tables into one may lead to unpredictable behaviorduring decoding In addition Axelrod et al [2] have con-cluded that it is more effective to use multidecoding method[24] than linear interpolation Thus we prefer to employ theconfusion ofmultiple phrase tables at decodingmodel for thistask

3 Model Description

This section describes four data selection criteria respec-tively based on cosine tf-idf perplexity and edit distance aswell as the proposed approach

31 Cosine tf-idf Each document119863119894is represented as a vector

(1199081198941 1199081198942 119908

119894119899) and 119899 is the size of the vocabulary So119908

119894119895is

calculated as follows

119908119894119895= tf119894119895times log (idf

119895) (4)

in which tf119894119895is the term frequency (TF) of the 119895th word in

the vocabulary in the document 119863119894and idf

119895is the inverse

document frequency (IDF) of the 119895th word calculated Thesimilarity between two texts is then defined as the cosine ofthe angle between two vectors We implemented this withApache Lucene (available at httpluceneapacheorg) andit is very similar to the experiments described in [6 15]Suppose that119872 is the size of query set and119873 is the numberof sentences retrieved from general corpus according to eachquery Thus the size of the cosine tf-idf based pseudo in-domain subcorpus is SizeCos-IR = 119872 times119873

32 Perplexity Perplexity is based on the cross-entropy (5)which is the average of the negative logarithm of the wordprobabilities Consider

119867(119901 119902) = minus

119899

sum

119894=1

119901 (119908119894) log 119902 (119908

119894)

= minus1

119873

119899

sum

119894=1

log 119902 (119908119894)

(5)

where 119901 denotes the empirical distribution of the test sample119901(119908119894) = 119899119873 if 119908

119894appeared 119899 times in the test sample

of size 119873 119902(119908119894) is the probability of event 119908

119894estimated

from the training set Thus the perplexity pp can be simplytransformed as

pp = 119887119867(119901119902) (6)

where 119887 is the base with respect to which the cross-entropyis measured (eg bits or nats) 119867(119901 119902) is the cross-entropygiven in (5) which is often applied as a cosmetic substitute ofperplexity for data selection [2 18]

Let 119867119868(119901 119902) and 119867

119874(119901 119902) be the cross-entropy of string

119908119894according to language model LM

119868and LM

119874which are

respectively trained by in-domain dataset 119868 and general-domain dataset 119866 Considering the source (src) and target

(tgt) sides of training data there are three perplexity-basedvariants The first is called basic cross-entropy given by

119867119868-src (119901 119902) (7)

The second one is Moore-Lewis cross-entropy difference[18]

119867119868-src (119901 119902) minus 119867119866-src (119901 119902) (8)

which tries to select the sentences that are more similar to119868 but different to others in 119866 All the above two criteriaonly consider the sentences in source language FurthermoreAxelrod et al [2] proposed a metric that sums cross-entropydifference over both sides

[119867119868-src (119901 119902) minus 119867119866-src (119901 119902)]

+ [119867119868-tgt (119901 119902) minus 119867119866-tgt (119901 119902)]

(9)

The candidates with lower scores (obtained by (7) (8)and (9)) have higher relevance to target domain The size ofthe perplexity-based pseudo in-domain subset SizePP shouldbe equal to SizeCos-IR In practice we perform SRILM toolkit(available at httpwwwspeechsricomprojectssrilm) [25]to conduct 5-gram LMs with interpolated modified Kneser-Ney discounting [26]

33 Edit Distance Given a sentence 119904119866from119866 and a sentence

119904119877from the test set or in-domain corpus the edit distance for

these two sequences is defined as the minimum number ofedits that is symbol insertions deletions and substitutionsneeded to transform 119904

119866into 119904

119877 Based on Levenshtein

distance or edit distance there are several different imple-mentations We used the normalized Levenshtein similarityscore (fuzzy matching score FMS)

FMS = 1 minusLD (119904119866 119904119877)

Max (10038161003816100381610038161199041198661003816100381610038161003816 10038161003816100381610038161199041198771003816100381610038161003816) (10)

where |119904119866| and |119904

119877| are the lengths 119904

119866and 119904119877[20 21] In this

study we employed theword-based Levenshtein edit distancefunction If there is a sentence of which its score exceeds athreshold we would further penalize these sentences accord-ing to the whitespace and punctuations editing differencesThe algorithm is implemented in map reduce to parallelizethe process and shorten the processing time

34 Proposed Model For corpus level combination werespectively perform GIZA++ toolkit (httpwwwstatmtorgmosesgizaGIZA++html) on pseudo in-domain sub-corpora selected by different data selection methods Theneach sentence in different subcorpora can be weighted bymodifying its corresponding occurrences in the GIZA++ file[6] Finally we combine these GIZA++ files together as a newone Formally this combination method can be defined asfollows

120572Cos IR (119878119909 119879119909)

cup120573PPBased (119878119910 119879119910)

cup120582EDBased (119878119911 119879119911)

(11)

The Scientific World Journal 5

where 120572 120573 and 120582 are weights for sentences pairs (119878119909 119879119909)

(119878119910 119879119910) and (119878

119911 119879119911) which are selected by cosine tf-

idf (Cos IR) perplexity-based (PPBased) and edit-distance-based (EDBased)methods respectively Note that the num-ber of sentence occurrence should be integer In practicewe made 120572 120573 and 120582 integer numbers and multiplied theseweights with the occurrences of each corresponding sentencein GIZA++ file

For model level combination we perform linear interpo-lation on the models trained with the subcorpora retrievedby different data selection methods The phrase translationprobability 120601(119905 | 119904) and the lexical weight 119901

119908(119905 | 119904 119886) are

estimated using (12) and (13) respectively as follows

120601 (119905 | 119904) =

119899

sum

119894=0

120572119894120601119894(119905 | 119904) (12)

119901119908(119905 | 119904 119886) =

119899

sum

119894=0

120573119894119901119908119894(119905 | 119904 119886) (13)

where 119894 = 1 2 and 3 denoting the phrase translationprobability and lexical weights trained on the subcorporaretrieved by Cos IR PP Based and EDBased model 119904 and 119905are the phrases in source and target language 119886 is the align-ment information 120572

119894and 120573

119894are the tunable interpolation

parameters subject to sum120572119894= sum120573

119894= 1

In this paper we aremore interested in the comparison ofvarious data selection criteria rather than the best combina-tion method thus we did not tune these weights in the aboveequations and gave them equal weights (At corpus level 120572 =120573 = 120582 = 1 At model level 120572

119894= 120573119894= 13) in experiments

Moreover we will see that even these simple combinationscan lead to an ideal improvement of translation quality

4 Experimental Setup

41 Corpora Two corpora are needed for the domain adap-tation task The general corpus includes more than 1 millionparallel sentences comprising varieties of genres such asnewswires (LDC2005T10) translation example from dictio-naries law statements and sentences from online sourcesThe domain distribution of the general corpus is shownin Table 1 The miscellaneous part includes the sentencescrawled from various materials and the law portion includesthe articles collected from Chinese mainland Hong KongandMacau which have different lawsThe in-domain corpusdevelopment set and test set are randomly selected (that aredisjoined) from the Hong Kong law corpus (LDC2004T08)The size of the test set in-domain corpus and general corpuswe used is summarized in Table 2

In the preprocessing the English texts are tokenizedby Moses scripts (scripts are available at httpwwwstatmtorgeuroparl) [27] and the Chinese texts are seg-mented by NLPIR (available at httpictclasnlpirorg) [28]We removed the sentences of which length is more than 80Furthermore we only used the target side of data in generalcorpus as the general-domain LM training corpus Thus thetarget side of the pseudo in-domain corpus obtained by the

Table 1 Proportions of domains of general corpus

Domain Sent number News 279962 2460Novel 304932 2679Law 48754 428Miscellaneous 504396 4433Total 1138044 10000

previous methods (in Section 3) could be directly used fortraining adapted LM

42 Systems Description The experiments presented in thispaper were carried out with the Moses toolkit [29] a state-of-the-art open-source phrase-based SMT systemThe trans-lation and the reordering model relied on ldquogrow-diag-finalrdquosymmetrized word-to-word alignments built using GIZA++[30] and the training script of Moses The weights of the log-linear model were optimized by means of MERT [31]

A 5-gram language model was trained using the IRSTLMtoolkit [32] exploiting improved modified Kneser-Neysmoothing and quantizing both probabilities and back-offweights

43 Settings For the comparison totally five existing rep-resentative data selection models three baseline systemsand the proposed model were selected The correspondingsettings of the above models are as follows

(i) Baseline the baseline systems were trained withthe toolkits and settings as described in Section 42The in-domain baseline (IC-baseline) and general-domain baseline (GC-baseline) were respectivelytrained on in-domain corpus and general corpusThen a combined baseline system (GI-baseline) wascreated by passing the above two phrase tables to thedecoder and using them in parallel

(ii) Individualmodel as described in Sections 31 32 and33 the individual models are cosine tf-idf (Cos-IR)and fuzzy matching scorer (FMS) which is an edit-distance-based (ED-Based) instance as well as threeperplexity-based (PP-Based) variants cross-entropy(CE) cross-entropy difference (CED) and bilingualcross-entropy difference (B-CED)

(iii) Proposed model as described in Section 34 wecombined Cos-IR PP-Based and ED-Basedmethodsat corpus level (named iTPB-C) and model level(named iTPB-M)

The test set development set [6 15] and in-domaincorpus [2 18] can be used to select data from general-domaincorpus The first strategy is under the assumption that theinput data is known before building the models But in areal case we do not exactly know the test data in advanceTherefore we use the development set (dev strategy) andadditional in-domain corpus (in-domain strategy) which areidentical to the target domain to select data for TM and LMadaptation

6 The Scientific World Journal

Table 2 Corpora statistics

Data Set Lang Sentences Tokens Av len

Test set EN 2050 60399 2946ZH 59628 2909

Dev set EN 2000 59732 2926ZH 2000 59064 2907

In-domain EN 43621 1330464 2916ZH 1321655 2897

Training set EN 1138044 28626367 2515ZH 28239747 2481

5 Results and Discussions

For eachmethodwe used the119873-bests selection results where119873 = 20K 40K 80K 160K 320K 640K sentence pairs outof the 11M from the general corpus (roughly 175 3570 140 280 and 56 of general-domain corpus where119896 is short for thousand and 119898 is short for million) In thissection we firstly discuss three PPBased variants and thencompare the best one with other two models Cos-IR andFMS After that combined model and the best individual arecompared Finally the performance is further improved by asimple but effective system combination method

51 Baselines The baseline results (in Table 3) show that atranslation system trained on the general corpus outperformsthe one trained on the in-domain corpus over 285 BLEUpoints The main reason is that general corpus is so broadthat the OOVs aremuch lesser Combining these two systemscould further increase the GC-baseline by 191 points

52 IndividualModel Thecurves in Figure 2 show that all thethree PP Based variants that is CE CED and B-CED couldbe used to train domain-adapted SMT systems Using lesstraining data they still obtain better performance than GC-baseline CE is the simplest PP Based one and improves theGC-baseline by at most 164 points when selecting 320K outof 11M sentences Although CED gives bias to the sentencesthat are more similar to the target domain its performanceseems unstable with dev strategy and the worst with in-domain strategy This indicates that it is not suitable torandomly select data from general corpus for training LM

119874

(in Section 32) Similar to the conclusion in [2] B-CEDachieves the highest improvement with least selected dataamong the three variants It proves that bilingual resourcesare helpful to balance OOVs and noise However when usingin-domain corpus to select pseudo in-domain subcorporatheir trends are similar but worse They have to enlarge thesize of selected data to perform nearly as well as the resultsunder dev strategy Therefore PP Based approaches may notwork well for a real-life SMT application

Next we use B-CED (the best one among PPBased vari-ants) to compare with other selection criteria As shown inFigure 3 Cos-IR improves by at most 102 (dev) and 088 (in-domain) BLEU points using 28 data of the general corpusThe results approximately match with the conclusions given

Table 3 BLEU via general-domain and in-domain corpus

Baseline Sentences BLEUGC-baseline 11M 3915IC-baseline 45K 3630GI-baseline 4106

Table 4 Translation results of iTPB at different combination levels

Methods Sent BLEU(dev set)

BLEU(in-domain set)

B-CED80K 4091 3550160K 4112 3947320K 4002 4098

iTPB-C80K 4225 3939160K 4304 4187320K 4242 4044

iTPB-M80K 4293 4057160K 4365 4195320K 4397 4221

by [6 15] showing that keywords overlap (in Figure 1) playsa significant role in retrieving sentences in similar domainsHowever it still needs a large amount of selected data (morethan 280) to obtain an ideal performanceThemain reasonmay be that the sentences including same keywords still maybe irrelevant For example two sentences share the samephrase ldquoaccording to the articlerdquo but one sentence is in thelegal domain and the other is from news Figure 3 also showsthat FMS fails to outperform theGC-baseline system even it ismuch stricter than other criteriaWhen addingword positionfactor into similarity measuring FMS tries to find highlysimilar sentences on length collocation and even semanticsBut it seems that our general corpus is not large enoughto cover a certain amount of FMS-similar sentences Withthe size of general or in-domain corpus increases it maybenefit the translation quality because FMS still works betterthan IC-baseline which proves its positive impact on filteringnoise

Among the three presented criteria PP Based can achievethe highest BLEU with considering an appropriate amountof factors for similarity measuring However the curvesshow that it depends heavily upon the threshold 120579 in (2)

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

4 The Scientific World Journal

could be applied to combine the in-domain TM with thedomain-adapted TM However directly concatenating thephrase tables into one may lead to unpredictable behaviorduring decoding In addition Axelrod et al [2] have con-cluded that it is more effective to use multidecoding method[24] than linear interpolation Thus we prefer to employ theconfusion ofmultiple phrase tables at decodingmodel for thistask

3 Model Description

This section describes four data selection criteria respec-tively based on cosine tf-idf perplexity and edit distance aswell as the proposed approach

31 Cosine tf-idf Each document119863119894is represented as a vector

(1199081198941 1199081198942 119908

119894119899) and 119899 is the size of the vocabulary So119908

119894119895is

calculated as follows

119908119894119895= tf119894119895times log (idf

119895) (4)

in which tf119894119895is the term frequency (TF) of the 119895th word in

the vocabulary in the document 119863119894and idf

119895is the inverse

document frequency (IDF) of the 119895th word calculated Thesimilarity between two texts is then defined as the cosine ofthe angle between two vectors We implemented this withApache Lucene (available at httpluceneapacheorg) andit is very similar to the experiments described in [6 15]Suppose that119872 is the size of query set and119873 is the numberof sentences retrieved from general corpus according to eachquery Thus the size of the cosine tf-idf based pseudo in-domain subcorpus is SizeCos-IR = 119872 times119873

32 Perplexity Perplexity is based on the cross-entropy (5)which is the average of the negative logarithm of the wordprobabilities Consider

119867(119901 119902) = minus

119899

sum

119894=1

119901 (119908119894) log 119902 (119908

119894)

= minus1

119873

119899

sum

119894=1

log 119902 (119908119894)

(5)

where 119901 denotes the empirical distribution of the test sample119901(119908119894) = 119899119873 if 119908

119894appeared 119899 times in the test sample

of size 119873 119902(119908119894) is the probability of event 119908

119894estimated

from the training set Thus the perplexity pp can be simplytransformed as

pp = 119887119867(119901119902) (6)

where 119887 is the base with respect to which the cross-entropyis measured (eg bits or nats) 119867(119901 119902) is the cross-entropygiven in (5) which is often applied as a cosmetic substitute ofperplexity for data selection [2 18]

Let 119867119868(119901 119902) and 119867

119874(119901 119902) be the cross-entropy of string

119908119894according to language model LM

119868and LM

119874which are

respectively trained by in-domain dataset 119868 and general-domain dataset 119866 Considering the source (src) and target

(tgt) sides of training data there are three perplexity-basedvariants The first is called basic cross-entropy given by

119867119868-src (119901 119902) (7)

The second one is Moore-Lewis cross-entropy difference[18]

119867119868-src (119901 119902) minus 119867119866-src (119901 119902) (8)

which tries to select the sentences that are more similar to119868 but different to others in 119866 All the above two criteriaonly consider the sentences in source language FurthermoreAxelrod et al [2] proposed a metric that sums cross-entropydifference over both sides

[119867119868-src (119901 119902) minus 119867119866-src (119901 119902)]

+ [119867119868-tgt (119901 119902) minus 119867119866-tgt (119901 119902)]

(9)

The candidates with lower scores (obtained by (7) (8)and (9)) have higher relevance to target domain The size ofthe perplexity-based pseudo in-domain subset SizePP shouldbe equal to SizeCos-IR In practice we perform SRILM toolkit(available at httpwwwspeechsricomprojectssrilm) [25]to conduct 5-gram LMs with interpolated modified Kneser-Ney discounting [26]

33 Edit Distance Given a sentence 119904119866from119866 and a sentence

119904119877from the test set or in-domain corpus the edit distance for

these two sequences is defined as the minimum number ofedits that is symbol insertions deletions and substitutionsneeded to transform 119904

119866into 119904

119877 Based on Levenshtein

distance or edit distance there are several different imple-mentations We used the normalized Levenshtein similarityscore (fuzzy matching score FMS)

FMS = 1 minusLD (119904119866 119904119877)

Max (10038161003816100381610038161199041198661003816100381610038161003816 10038161003816100381610038161199041198771003816100381610038161003816) (10)

where |119904119866| and |119904

119877| are the lengths 119904

119866and 119904119877[20 21] In this

study we employed theword-based Levenshtein edit distancefunction If there is a sentence of which its score exceeds athreshold we would further penalize these sentences accord-ing to the whitespace and punctuations editing differencesThe algorithm is implemented in map reduce to parallelizethe process and shorten the processing time

34 Proposed Model For corpus level combination werespectively perform GIZA++ toolkit (httpwwwstatmtorgmosesgizaGIZA++html) on pseudo in-domain sub-corpora selected by different data selection methods Theneach sentence in different subcorpora can be weighted bymodifying its corresponding occurrences in the GIZA++ file[6] Finally we combine these GIZA++ files together as a newone Formally this combination method can be defined asfollows

120572Cos IR (119878119909 119879119909)

cup120573PPBased (119878119910 119879119910)

cup120582EDBased (119878119911 119879119911)

(11)

The Scientific World Journal 5

where 120572 120573 and 120582 are weights for sentences pairs (119878119909 119879119909)

(119878119910 119879119910) and (119878

119911 119879119911) which are selected by cosine tf-

idf (Cos IR) perplexity-based (PPBased) and edit-distance-based (EDBased)methods respectively Note that the num-ber of sentence occurrence should be integer In practicewe made 120572 120573 and 120582 integer numbers and multiplied theseweights with the occurrences of each corresponding sentencein GIZA++ file

For model level combination we perform linear interpo-lation on the models trained with the subcorpora retrievedby different data selection methods The phrase translationprobability 120601(119905 | 119904) and the lexical weight 119901

119908(119905 | 119904 119886) are

estimated using (12) and (13) respectively as follows

120601 (119905 | 119904) =

119899

sum

119894=0

120572119894120601119894(119905 | 119904) (12)

119901119908(119905 | 119904 119886) =

119899

sum

119894=0

120573119894119901119908119894(119905 | 119904 119886) (13)

where 119894 = 1 2 and 3 denoting the phrase translationprobability and lexical weights trained on the subcorporaretrieved by Cos IR PP Based and EDBased model 119904 and 119905are the phrases in source and target language 119886 is the align-ment information 120572

119894and 120573

119894are the tunable interpolation

parameters subject to sum120572119894= sum120573

119894= 1

In this paper we aremore interested in the comparison ofvarious data selection criteria rather than the best combina-tion method thus we did not tune these weights in the aboveequations and gave them equal weights (At corpus level 120572 =120573 = 120582 = 1 At model level 120572

119894= 120573119894= 13) in experiments

Moreover we will see that even these simple combinationscan lead to an ideal improvement of translation quality

4 Experimental Setup

41 Corpora Two corpora are needed for the domain adap-tation task The general corpus includes more than 1 millionparallel sentences comprising varieties of genres such asnewswires (LDC2005T10) translation example from dictio-naries law statements and sentences from online sourcesThe domain distribution of the general corpus is shownin Table 1 The miscellaneous part includes the sentencescrawled from various materials and the law portion includesthe articles collected from Chinese mainland Hong KongandMacau which have different lawsThe in-domain corpusdevelopment set and test set are randomly selected (that aredisjoined) from the Hong Kong law corpus (LDC2004T08)The size of the test set in-domain corpus and general corpuswe used is summarized in Table 2

In the preprocessing the English texts are tokenizedby Moses scripts (scripts are available at httpwwwstatmtorgeuroparl) [27] and the Chinese texts are seg-mented by NLPIR (available at httpictclasnlpirorg) [28]We removed the sentences of which length is more than 80Furthermore we only used the target side of data in generalcorpus as the general-domain LM training corpus Thus thetarget side of the pseudo in-domain corpus obtained by the

Table 1 Proportions of domains of general corpus

Domain Sent number News 279962 2460Novel 304932 2679Law 48754 428Miscellaneous 504396 4433Total 1138044 10000

previous methods (in Section 3) could be directly used fortraining adapted LM

42 Systems Description The experiments presented in thispaper were carried out with the Moses toolkit [29] a state-of-the-art open-source phrase-based SMT systemThe trans-lation and the reordering model relied on ldquogrow-diag-finalrdquosymmetrized word-to-word alignments built using GIZA++[30] and the training script of Moses The weights of the log-linear model were optimized by means of MERT [31]

A 5-gram language model was trained using the IRSTLMtoolkit [32] exploiting improved modified Kneser-Neysmoothing and quantizing both probabilities and back-offweights

43 Settings For the comparison totally five existing rep-resentative data selection models three baseline systemsand the proposed model were selected The correspondingsettings of the above models are as follows

(i) Baseline the baseline systems were trained withthe toolkits and settings as described in Section 42The in-domain baseline (IC-baseline) and general-domain baseline (GC-baseline) were respectivelytrained on in-domain corpus and general corpusThen a combined baseline system (GI-baseline) wascreated by passing the above two phrase tables to thedecoder and using them in parallel

(ii) Individualmodel as described in Sections 31 32 and33 the individual models are cosine tf-idf (Cos-IR)and fuzzy matching scorer (FMS) which is an edit-distance-based (ED-Based) instance as well as threeperplexity-based (PP-Based) variants cross-entropy(CE) cross-entropy difference (CED) and bilingualcross-entropy difference (B-CED)

(iii) Proposed model as described in Section 34 wecombined Cos-IR PP-Based and ED-Basedmethodsat corpus level (named iTPB-C) and model level(named iTPB-M)

The test set development set [6 15] and in-domaincorpus [2 18] can be used to select data from general-domaincorpus The first strategy is under the assumption that theinput data is known before building the models But in areal case we do not exactly know the test data in advanceTherefore we use the development set (dev strategy) andadditional in-domain corpus (in-domain strategy) which areidentical to the target domain to select data for TM and LMadaptation

6 The Scientific World Journal

Table 2 Corpora statistics

Data Set Lang Sentences Tokens Av len

Test set EN 2050 60399 2946ZH 59628 2909

Dev set EN 2000 59732 2926ZH 2000 59064 2907

In-domain EN 43621 1330464 2916ZH 1321655 2897

Training set EN 1138044 28626367 2515ZH 28239747 2481

5 Results and Discussions

For eachmethodwe used the119873-bests selection results where119873 = 20K 40K 80K 160K 320K 640K sentence pairs outof the 11M from the general corpus (roughly 175 3570 140 280 and 56 of general-domain corpus where119896 is short for thousand and 119898 is short for million) In thissection we firstly discuss three PPBased variants and thencompare the best one with other two models Cos-IR andFMS After that combined model and the best individual arecompared Finally the performance is further improved by asimple but effective system combination method

51 Baselines The baseline results (in Table 3) show that atranslation system trained on the general corpus outperformsthe one trained on the in-domain corpus over 285 BLEUpoints The main reason is that general corpus is so broadthat the OOVs aremuch lesser Combining these two systemscould further increase the GC-baseline by 191 points

52 IndividualModel Thecurves in Figure 2 show that all thethree PP Based variants that is CE CED and B-CED couldbe used to train domain-adapted SMT systems Using lesstraining data they still obtain better performance than GC-baseline CE is the simplest PP Based one and improves theGC-baseline by at most 164 points when selecting 320K outof 11M sentences Although CED gives bias to the sentencesthat are more similar to the target domain its performanceseems unstable with dev strategy and the worst with in-domain strategy This indicates that it is not suitable torandomly select data from general corpus for training LM

119874

(in Section 32) Similar to the conclusion in [2] B-CEDachieves the highest improvement with least selected dataamong the three variants It proves that bilingual resourcesare helpful to balance OOVs and noise However when usingin-domain corpus to select pseudo in-domain subcorporatheir trends are similar but worse They have to enlarge thesize of selected data to perform nearly as well as the resultsunder dev strategy Therefore PP Based approaches may notwork well for a real-life SMT application

Next we use B-CED (the best one among PPBased vari-ants) to compare with other selection criteria As shown inFigure 3 Cos-IR improves by at most 102 (dev) and 088 (in-domain) BLEU points using 28 data of the general corpusThe results approximately match with the conclusions given

Table 3 BLEU via general-domain and in-domain corpus

Baseline Sentences BLEUGC-baseline 11M 3915IC-baseline 45K 3630GI-baseline 4106

Table 4 Translation results of iTPB at different combination levels

Methods Sent BLEU(dev set)

BLEU(in-domain set)

B-CED80K 4091 3550160K 4112 3947320K 4002 4098

iTPB-C80K 4225 3939160K 4304 4187320K 4242 4044

iTPB-M80K 4293 4057160K 4365 4195320K 4397 4221

by [6 15] showing that keywords overlap (in Figure 1) playsa significant role in retrieving sentences in similar domainsHowever it still needs a large amount of selected data (morethan 280) to obtain an ideal performanceThemain reasonmay be that the sentences including same keywords still maybe irrelevant For example two sentences share the samephrase ldquoaccording to the articlerdquo but one sentence is in thelegal domain and the other is from news Figure 3 also showsthat FMS fails to outperform theGC-baseline system even it ismuch stricter than other criteriaWhen addingword positionfactor into similarity measuring FMS tries to find highlysimilar sentences on length collocation and even semanticsBut it seems that our general corpus is not large enoughto cover a certain amount of FMS-similar sentences Withthe size of general or in-domain corpus increases it maybenefit the translation quality because FMS still works betterthan IC-baseline which proves its positive impact on filteringnoise

Among the three presented criteria PP Based can achievethe highest BLEU with considering an appropriate amountof factors for similarity measuring However the curvesshow that it depends heavily upon the threshold 120579 in (2)

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

The Scientific World Journal 5

where 120572 120573 and 120582 are weights for sentences pairs (119878119909 119879119909)

(119878119910 119879119910) and (119878

119911 119879119911) which are selected by cosine tf-

idf (Cos IR) perplexity-based (PPBased) and edit-distance-based (EDBased)methods respectively Note that the num-ber of sentence occurrence should be integer In practicewe made 120572 120573 and 120582 integer numbers and multiplied theseweights with the occurrences of each corresponding sentencein GIZA++ file

For model level combination we perform linear interpo-lation on the models trained with the subcorpora retrievedby different data selection methods The phrase translationprobability 120601(119905 | 119904) and the lexical weight 119901

119908(119905 | 119904 119886) are

estimated using (12) and (13) respectively as follows

120601 (119905 | 119904) =

119899

sum

119894=0

120572119894120601119894(119905 | 119904) (12)

119901119908(119905 | 119904 119886) =

119899

sum

119894=0

120573119894119901119908119894(119905 | 119904 119886) (13)

where 119894 = 1 2 and 3 denoting the phrase translationprobability and lexical weights trained on the subcorporaretrieved by Cos IR PP Based and EDBased model 119904 and 119905are the phrases in source and target language 119886 is the align-ment information 120572

119894and 120573

119894are the tunable interpolation

parameters subject to sum120572119894= sum120573

119894= 1

In this paper we aremore interested in the comparison ofvarious data selection criteria rather than the best combina-tion method thus we did not tune these weights in the aboveequations and gave them equal weights (At corpus level 120572 =120573 = 120582 = 1 At model level 120572

119894= 120573119894= 13) in experiments

Moreover we will see that even these simple combinationscan lead to an ideal improvement of translation quality

4 Experimental Setup

41 Corpora Two corpora are needed for the domain adap-tation task The general corpus includes more than 1 millionparallel sentences comprising varieties of genres such asnewswires (LDC2005T10) translation example from dictio-naries law statements and sentences from online sourcesThe domain distribution of the general corpus is shownin Table 1 The miscellaneous part includes the sentencescrawled from various materials and the law portion includesthe articles collected from Chinese mainland Hong KongandMacau which have different lawsThe in-domain corpusdevelopment set and test set are randomly selected (that aredisjoined) from the Hong Kong law corpus (LDC2004T08)The size of the test set in-domain corpus and general corpuswe used is summarized in Table 2

In the preprocessing the English texts are tokenizedby Moses scripts (scripts are available at httpwwwstatmtorgeuroparl) [27] and the Chinese texts are seg-mented by NLPIR (available at httpictclasnlpirorg) [28]We removed the sentences of which length is more than 80Furthermore we only used the target side of data in generalcorpus as the general-domain LM training corpus Thus thetarget side of the pseudo in-domain corpus obtained by the

Table 1 Proportions of domains of general corpus

Domain Sent number News 279962 2460Novel 304932 2679Law 48754 428Miscellaneous 504396 4433Total 1138044 10000

previous methods (in Section 3) could be directly used fortraining adapted LM

42 Systems Description The experiments presented in thispaper were carried out with the Moses toolkit [29] a state-of-the-art open-source phrase-based SMT systemThe trans-lation and the reordering model relied on ldquogrow-diag-finalrdquosymmetrized word-to-word alignments built using GIZA++[30] and the training script of Moses The weights of the log-linear model were optimized by means of MERT [31]

A 5-gram language model was trained using the IRSTLMtoolkit [32] exploiting improved modified Kneser-Neysmoothing and quantizing both probabilities and back-offweights

43 Settings For the comparison totally five existing rep-resentative data selection models three baseline systemsand the proposed model were selected The correspondingsettings of the above models are as follows

(i) Baseline the baseline systems were trained withthe toolkits and settings as described in Section 42The in-domain baseline (IC-baseline) and general-domain baseline (GC-baseline) were respectivelytrained on in-domain corpus and general corpusThen a combined baseline system (GI-baseline) wascreated by passing the above two phrase tables to thedecoder and using them in parallel

(ii) Individualmodel as described in Sections 31 32 and33 the individual models are cosine tf-idf (Cos-IR)and fuzzy matching scorer (FMS) which is an edit-distance-based (ED-Based) instance as well as threeperplexity-based (PP-Based) variants cross-entropy(CE) cross-entropy difference (CED) and bilingualcross-entropy difference (B-CED)

(iii) Proposed model as described in Section 34 wecombined Cos-IR PP-Based and ED-Basedmethodsat corpus level (named iTPB-C) and model level(named iTPB-M)

The test set development set [6 15] and in-domaincorpus [2 18] can be used to select data from general-domaincorpus The first strategy is under the assumption that theinput data is known before building the models But in areal case we do not exactly know the test data in advanceTherefore we use the development set (dev strategy) andadditional in-domain corpus (in-domain strategy) which areidentical to the target domain to select data for TM and LMadaptation

6 The Scientific World Journal

Table 2 Corpora statistics

Data Set Lang Sentences Tokens Av len

Test set EN 2050 60399 2946ZH 59628 2909

Dev set EN 2000 59732 2926ZH 2000 59064 2907

In-domain EN 43621 1330464 2916ZH 1321655 2897

Training set EN 1138044 28626367 2515ZH 28239747 2481

5 Results and Discussions

For eachmethodwe used the119873-bests selection results where119873 = 20K 40K 80K 160K 320K 640K sentence pairs outof the 11M from the general corpus (roughly 175 3570 140 280 and 56 of general-domain corpus where119896 is short for thousand and 119898 is short for million) In thissection we firstly discuss three PPBased variants and thencompare the best one with other two models Cos-IR andFMS After that combined model and the best individual arecompared Finally the performance is further improved by asimple but effective system combination method

51 Baselines The baseline results (in Table 3) show that atranslation system trained on the general corpus outperformsthe one trained on the in-domain corpus over 285 BLEUpoints The main reason is that general corpus is so broadthat the OOVs aremuch lesser Combining these two systemscould further increase the GC-baseline by 191 points

52 IndividualModel Thecurves in Figure 2 show that all thethree PP Based variants that is CE CED and B-CED couldbe used to train domain-adapted SMT systems Using lesstraining data they still obtain better performance than GC-baseline CE is the simplest PP Based one and improves theGC-baseline by at most 164 points when selecting 320K outof 11M sentences Although CED gives bias to the sentencesthat are more similar to the target domain its performanceseems unstable with dev strategy and the worst with in-domain strategy This indicates that it is not suitable torandomly select data from general corpus for training LM

119874

(in Section 32) Similar to the conclusion in [2] B-CEDachieves the highest improvement with least selected dataamong the three variants It proves that bilingual resourcesare helpful to balance OOVs and noise However when usingin-domain corpus to select pseudo in-domain subcorporatheir trends are similar but worse They have to enlarge thesize of selected data to perform nearly as well as the resultsunder dev strategy Therefore PP Based approaches may notwork well for a real-life SMT application

Next we use B-CED (the best one among PPBased vari-ants) to compare with other selection criteria As shown inFigure 3 Cos-IR improves by at most 102 (dev) and 088 (in-domain) BLEU points using 28 data of the general corpusThe results approximately match with the conclusions given

Table 3 BLEU via general-domain and in-domain corpus

Baseline Sentences BLEUGC-baseline 11M 3915IC-baseline 45K 3630GI-baseline 4106

Table 4 Translation results of iTPB at different combination levels

Methods Sent BLEU(dev set)

BLEU(in-domain set)

B-CED80K 4091 3550160K 4112 3947320K 4002 4098

iTPB-C80K 4225 3939160K 4304 4187320K 4242 4044

iTPB-M80K 4293 4057160K 4365 4195320K 4397 4221

by [6 15] showing that keywords overlap (in Figure 1) playsa significant role in retrieving sentences in similar domainsHowever it still needs a large amount of selected data (morethan 280) to obtain an ideal performanceThemain reasonmay be that the sentences including same keywords still maybe irrelevant For example two sentences share the samephrase ldquoaccording to the articlerdquo but one sentence is in thelegal domain and the other is from news Figure 3 also showsthat FMS fails to outperform theGC-baseline system even it ismuch stricter than other criteriaWhen addingword positionfactor into similarity measuring FMS tries to find highlysimilar sentences on length collocation and even semanticsBut it seems that our general corpus is not large enoughto cover a certain amount of FMS-similar sentences Withthe size of general or in-domain corpus increases it maybenefit the translation quality because FMS still works betterthan IC-baseline which proves its positive impact on filteringnoise

Among the three presented criteria PP Based can achievethe highest BLEU with considering an appropriate amountof factors for similarity measuring However the curvesshow that it depends heavily upon the threshold 120579 in (2)

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

6 The Scientific World Journal

Table 2 Corpora statistics

Data Set Lang Sentences Tokens Av len

Test set EN 2050 60399 2946ZH 59628 2909

Dev set EN 2000 59732 2926ZH 2000 59064 2907

In-domain EN 43621 1330464 2916ZH 1321655 2897

Training set EN 1138044 28626367 2515ZH 28239747 2481

5 Results and Discussions

For eachmethodwe used the119873-bests selection results where119873 = 20K 40K 80K 160K 320K 640K sentence pairs outof the 11M from the general corpus (roughly 175 3570 140 280 and 56 of general-domain corpus where119896 is short for thousand and 119898 is short for million) In thissection we firstly discuss three PPBased variants and thencompare the best one with other two models Cos-IR andFMS After that combined model and the best individual arecompared Finally the performance is further improved by asimple but effective system combination method

51 Baselines The baseline results (in Table 3) show that atranslation system trained on the general corpus outperformsthe one trained on the in-domain corpus over 285 BLEUpoints The main reason is that general corpus is so broadthat the OOVs aremuch lesser Combining these two systemscould further increase the GC-baseline by 191 points

52 IndividualModel Thecurves in Figure 2 show that all thethree PP Based variants that is CE CED and B-CED couldbe used to train domain-adapted SMT systems Using lesstraining data they still obtain better performance than GC-baseline CE is the simplest PP Based one and improves theGC-baseline by at most 164 points when selecting 320K outof 11M sentences Although CED gives bias to the sentencesthat are more similar to the target domain its performanceseems unstable with dev strategy and the worst with in-domain strategy This indicates that it is not suitable torandomly select data from general corpus for training LM

119874

(in Section 32) Similar to the conclusion in [2] B-CEDachieves the highest improvement with least selected dataamong the three variants It proves that bilingual resourcesare helpful to balance OOVs and noise However when usingin-domain corpus to select pseudo in-domain subcorporatheir trends are similar but worse They have to enlarge thesize of selected data to perform nearly as well as the resultsunder dev strategy Therefore PP Based approaches may notwork well for a real-life SMT application

Next we use B-CED (the best one among PPBased vari-ants) to compare with other selection criteria As shown inFigure 3 Cos-IR improves by at most 102 (dev) and 088 (in-domain) BLEU points using 28 data of the general corpusThe results approximately match with the conclusions given

Table 3 BLEU via general-domain and in-domain corpus

Baseline Sentences BLEUGC-baseline 11M 3915IC-baseline 45K 3630GI-baseline 4106

Table 4 Translation results of iTPB at different combination levels

Methods Sent BLEU(dev set)

BLEU(in-domain set)

B-CED80K 4091 3550160K 4112 3947320K 4002 4098

iTPB-C80K 4225 3939160K 4304 4187320K 4242 4044

iTPB-M80K 4293 4057160K 4365 4195320K 4397 4221

by [6 15] showing that keywords overlap (in Figure 1) playsa significant role in retrieving sentences in similar domainsHowever it still needs a large amount of selected data (morethan 280) to obtain an ideal performanceThemain reasonmay be that the sentences including same keywords still maybe irrelevant For example two sentences share the samephrase ldquoaccording to the articlerdquo but one sentence is in thelegal domain and the other is from news Figure 3 also showsthat FMS fails to outperform theGC-baseline system even it ismuch stricter than other criteriaWhen addingword positionfactor into similarity measuring FMS tries to find highlysimilar sentences on length collocation and even semanticsBut it seems that our general corpus is not large enoughto cover a certain amount of FMS-similar sentences Withthe size of general or in-domain corpus increases it maybenefit the translation quality because FMS still works betterthan IC-baseline which proves its positive impact on filteringnoise

Among the three presented criteria PP Based can achievethe highest BLEU with considering an appropriate amountof factors for similarity measuring However the curvesshow that it depends heavily upon the threshold 120579 in (2)

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

The Scientific World Journal 7

0 150 300 450 600 750 900 1050330

345

360

375

390

405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050180195210225240255270285300315330345360375390405

BLEU

CECED

B-CEDGC-base

The numbers of selected sentences (k)

(b)

Figure 2 BLEU scores via perplexity-based data selection methods with dev (a) and in-domain (b) strategies

0 150 300 450 600 750 900 1050

345

360

375

390

405

420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(a)

0 150 300 450 600 750 900 1050

240255270285300315330345360375390405420

BLEU

Cos-IRB-CED

FMSGC-base

The numbers of selected sentences (k)

(b)

Figure 3 BLEU scores via different data selection methods with dev (a) and in-domain (b) strategies

Selecting more or less pseudo in-domain data will lead tothe performance dropping sharply Instead Cos-IR workssteadily and robustly with either and both strategies but itsimprovements are not clearThus any single individualmodelcannot perform well on both effectiveness and robustness

53 Combined Model From Figure 3 we found that eachindividual model peaks between 80K and 320K Thus weonly selected the top 119873 = 80K 160K 320K for furthercomparisonWe combinedCos-IR and FMS aswell as B-CEDand assigned equal weights to each individual model at bothcorpus and model levels (as described in Section 34) Thetranslation qualities via iTPB are shown in Table 4

At both levels iTPBperformsmuch better than any singleindividual model as well as GC-baseline system For instanceiTPB-C has achieved at most 389 (dev) and 272 (in-domain)improvements than the baseline system Also the result isstill higher than the best individual model (B-CED) by 192(dev) and 091 (in-domain) This shows a strong ability tobalance OOV and noise On the one hand filtering toomuch unmatched words may not sufficiently address the datasparsity issue of the SMT model on the other hand addingtoo much of the selected data may lead to the dilution ofthe in-domain characteristics of the SMT model Howevercombinations seem to succeed the pros and reduce the consof the individualmodel In addition the performance of iTPBdoes not drop sharply when changing the threshold in (2)

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

8 The Scientific World Journal

Table 5 Results of combination models

Methods Sent BLEU(dev set)

BLEU(in-domain set)

GI-baseline 4106IC-baseline 45K 3630

B-CED+I80K 4105 3580160K 4114 3954320K 4167 4067

iTPB-C+I80K 4285 4145160K 4348 4188320K 4363 4198

iTPB-M+I80K 4333 4183160K 4413 4243320K 4437 4253

or the strategies This shows that combined models are bothstable and robust to train a state-of-the-art SMT system Wealso prove that model level combination works better (upto +1 BLEU) than corpus level combination with the samesettings

54 System Combination In order to advantageously useall available data we also combine the in-domain TM withpseudo in-domain TM to further improve the translationquality We used multiple paths decoding technique asdescribed in Section 22

Table 5 shows that the translation system trained on apseudo in-domain subset selected with individual or com-bined models can be further improved by combining withan in-domain TM The iTPB-M+I system comprising twophrase tables trained on iTPB-M based pseudo in-domainsubset and in-domain corpus still works best This tiny com-bined system is at most 331+ (dev) and 147+ (in-domain)points better than the GI-baseline system 807+ (dev) and623+ (in-domain) points better than the in-domain systemalone

6 Conclusions

In this paper we analyze the impacts of different dataselection criteria on SMT domain adaptation This is the firsttime to systematically compare the state-of-the-art data selec-tion methods such as cosine tf-idf and perplexity Amongthe revisited methods we consider edit distance as a newsimilarity metric for this task The in-depth analysis can bevery valuable to other researchers working on data selection

Based on the investigation a combination approach tomake use of those individual models is extensively proposedand evaluated The combination is conducted at not only thecorpora level but also themodels level where domain-adaptedsystems trained via different methods are interpolated tofacilitate the translation Five individual methods Cos-IRCE CED B-CED and FMS three baseline systems and GC-baseline IC-baseline andGI-baseline aswell as the proposedmethod are evaluated on a large general corpus Finally we

further combined the best domain-adapted system with theTM trained on in-domain corpus tomaximize the translationquality Empirical results reveal that the proposed modelachieves a good performance in terms of robustness andeffectiveness

We analyze the results from three different aspects

(i) Translation quality the results show a significantperformance of themostmethods especially for iTPBUnder the current size of datasets considering morefactors in similarity measuring may not benefit thetranslation quality

(ii) Noise and OOVs it is a big challenge to balance themfor single individual data selection model Howeverbilingual resources and combination methods arehelpful to deal with this problem

(iii) Robustness and effectiveness a real-life system shouldachieve a robust and effective performance with in-domain strategy Only iTPB obtained a consistentlyboosting performance

Finally we can draw a composite conclusion that (119886 gt 119887

means that 119886 is better than 119887)

119894TPB gt PPBased gt Cos-IR gt Baseline gt FMS (14)

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors are grateful to the Science and TechnologyDevelopment Fund of Macau and the Research Committeeof the University of Macau for the funding support for theirresearch under theReference nosMYRG076 (Y1-L2)-FST13-WF and MYRG070 (Y1-L2)-FST12-CS

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

The Scientific World Journal 9

References

[1] P F Brown V J D Pietra S A D Pietra and R L MercerldquoThe mathematics of statistical machine translation parameterestimationrdquo Computational Linguistics vol 19 no 2 pp 263ndash311 1993

[2] A Axelrod X He and J Gao ldquoDomain adaptation via pseudoin-domain data selectionrdquo in Proceedings of the Conference onEmpiricalMethods in Natural Language Processing (EMNLP rsquo11)pp 355ndash362 July 2011

[3] P Koehn F J Och and D Marcu ldquoStatistical phrase-basedtranslationrdquo in Proceedings of the Conference of the North Amer-ican Chapter of the Association for Computational Linguistics onHuman Language Technology (NAACL rsquo03) vol 1 pp 48ndash542003

[4] P Koehn A Axelrod A B Mayne C Callison-Burch MOsborne and D Talbot ldquoEdinburgh system description forthe 2005 IWSLT speech translation evaluationrdquo in Proceedingsof the International Workshop on Spoken Language Translation(IWSLT rsquo05) pp 68ndash75 2005

[5] P Koehn ldquoPharaoh a beam search decoder for phrase-basedstatistical machine translation modelsrdquo in Proceedings of theAntenna Measurement Techniques Association (AMTA rsquo04) pp115ndash124 Springer Berlin Germany 2004

[6] Y Lu J Huang and Q Liu ldquoImproving statistical machinetranslation performance by training data selection and opti-mizationrdquo in Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL rsquo07) pp 343ndash350June 2007

[7] K Papineni S Roukos T Ward and W J Zhu ldquoBLEU amethod for automatic evaluation of machine translationrdquo inProceedings of theWorkshop on Automatic Summarization (ACLrsquo02) pp 311ndash318 Philadephia Pa USA 2002

[8] H Daume III and J Jagarlamudi ldquoDomain adaptation formachine translation by mining unseen wordsrdquo in Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics Human Language Technologies (ACL-HLT rsquo11) pp407ndash412 June 2011

[9] SMansour andH Ney ldquoA simple and effective weighted phraseextraction formachine translation adaptationrdquo inProceedings ofthe 9th International Workshop on Spoken Language Translation(IWSLT rsquo12) pp 193ndash200 2012

[10] P Koehn and BHaddow ldquoTowards effective use of training datain statistical machine translationrdquo in Proceedings of the 7th ACLWorkshop on Statistical Machine Translation pp 317ndash321 2012

[11] J Civera andA Juan ldquoDomain adaptation in statisticalmachinetranslation with mixture modelingrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 177ndash1802007

[12] G Foster and R Kuhn ldquoMixture-model adaptation for SMTrdquoin Proceedings of the 2nd ACL Workshop on Statistical MachineTranslation pp 128ndash136 2007

[13] V Eidelman J Boyd-Graber and P Resnik ldquoTopic modelsfor dynamic translation model adaptationrdquo in Proceedings ofthe 50th Annual Meeting of the Association for ComputationalLinguistics Short Papers (ACL rsquo12) vol 2 pp 115ndash119 2012

[14] S Matsoukas A V I Rosti and B Zhang ldquoDiscriminative cor-pus weight estimation for machine translationrdquo in Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP rsquo09) vol 2 pp 708ndash717 August 2009

[15] A S Hildebrand M Eck S Vogel and A Waibel ldquoAdaptationof the translation model for statistical machine translationbased on information retrievalrdquo in Proceedings of the 10thAnnual Conference on European Association for Machine Trans-lation (EAMT rsquo05) pp 133ndash142 May 2005

[16] S C Lin C L Tsai L F Chien K J Chen and L SLee ldquoChinese language model adaptation based on documentclassification and multiple domain-specific language modelsrdquoin Proceedings of the 5th European Conference on Speech Com-munication and Technology 1997

[17] J Gao J Goodman M Li and K F Lee ldquoToward a uni-fied approach to statistical language modeling for Chineserdquoin Proceedings of the ACM Transactions on Asian LanguageInformation Processing (TALIP rsquo02) vol 1 pp 3ndash33 2002

[18] R C Moore and W Lewis ldquoIntelligent selection of languagemodel training datardquo in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics (ACL rsquo10) pp220ndash224 July 2010

[19] V I Levenshtein ldquoBinary codes capable of correcting deletionsinsertions and reversalsrdquo Soviet Physics Doklady vol 10 p 7071966

[20] P Koehn and J Senellart ldquoConvergence of translation memoryand statistical machine translationrdquo in Proceedings of AMTAWorkshop on MT Research and the Translation Industry pp 21ndash31 2010

[21] J Leveling D Ganguly S Dandapat andG J F Jones ldquoApprox-imate sentence retrieval for scalable and efficient example-basedmachine translationrdquo in Proceedings of the 24th InternationalConference on Computational Linguistics (COLING rsquo12) pp1571ndash1586 2012

[22] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning vol 1 Springer New York NY USA 2001

[23] HWuandHWang ldquoPivot language approach for phrase-basedstatistical machine translationrdquoMachine Translation vol 21 no3 pp 165ndash181 2007

[24] P Koehn and J Schroeder ldquoExperiments in domain adaptationfor statistical machine translationrdquo in Proceedings of the 2ndACL Workshop on Statistical Machine Translation pp 224ndash2272007

[25] A Stolcke ldquoSRILMmdashan extensible language modeling toolkitrdquoin Proceedings of the International Conference on Spoken Lan-guage Processing pp 901ndash904 2002

[26] S F Chen and J Goodman ldquoAn empirical study of smoothingtechniques for language modelingrdquo in Proceedings of the 34thAnnual Meeting on Association for Computational Linguistics(ACL rsquo96) pp 310ndash318 1996

[27] P Koehn ldquoEuroparl a parallel corpus for statistical machinetranslationrdquo in Proceedings of the 10th Conference of MachineTranslation Summit (AAMT rsquo05) vol 5 pp 79ndash86 PhuketThailand 2005

[28] H P Zhang H K Yu D Y Xiong and Q Liu ldquoHHMM-basedChinese lexical analyzer ICTCLASrdquo in Proceedings of the 2ndSIGHANWorkshop on Chinese Language Processing vol 17 pp184ndash187 2003

[29] P Koehn H Hoang A Birch et al ldquoMoses open sourcetoolkit for statistical machine translationrdquo in Proceedings ofthe 45th Annual Meeting of the ACL on Interactive Poster andDemonstration Sessions (ACL rsquo07) pp 177ndash180 2007

[30] F J Och and H Ney ldquoA systematic comparison of variousstatistical alignmentmodelsrdquoComputational Linguistics vol 29no 1 pp 19ndash51 2003

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

10 The Scientific World Journal

[31] F J Och ldquoMinimum error rate training in statistical machinetranslationrdquo in Proceedings of the 41st Annual Meeting onAssociation for Computational Linguistics (ACL rsquo03) vol 1 pp160ndash167 2003

[32] M Federico N Bertoldi and M Cettolo ldquoIRSTLM an opensource toolkit for handling large scale language modelsrdquo inProceedings of the 9th Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH rsquo08) pp1618ndash1621 September 2008

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article A Systematic Comparison of Data Selection …downloads.hindawi.com/journals/tswj/2014/745485.pdf · 2019-07-31 · Research Article A Systematic Comparison of Data

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014