2005061706155551

11
Equivalence and Non-equivalence in Parallel Corpora * T AM ´ AS V ´ ARADI AND G ´ ABOR KISS The present paper shows how an aligned parallel corpus can be used to in- vestigate the consistency of translation equivalence across the two languages in a parallel corpus. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give, and the preposition with. George Orwell’s novel 1984 was used as source material as it available in English-Hungarian sentence- aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds im- portant implications for bilingual lexicography and automatic word alignment methodology. KEYWORDS: parallel corpora, multilingual alignment, sense discrimination 1. Introduction One of the stumbling blocks to machine translation is that words rarely stand in stable one-to-one correspondence with each other across languages. In- stead, they typically have a ramified set of senses that would be rendered with a set of different lexemes in the other language. Bilingual dictionaries can give only limited help in finding the appropriate lexical correspondences because they provide very little information as to which alternative would be suitable in the given context. In fact, as has been recently pointed out by INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 167–177 John Benjamins Publishing Co.

description

f

Transcript of 2005061706155551

  • Equivalence and Non-equivalence inParallel Corpora *

    TAMAS VARADI AND GABOR KISS

    The present paper shows how an aligned parallel corpus can be used to in-vestigate the consistency of translation equivalence across the two languagesin a parallel corpus. The particular issues addressed are the bidirectionalityof translation equivalence, the coverage of multiword units, and the amount ofimplicit knowledge presupposed on the part of the user in interpreting the data.Three lexical items belonging to different word classes were chosen for analysis:the noun head, the verb give, and the preposition with. George Orwells novel1984 was used as source material as it available in English-Hungarian sentence-aligned form. It is argued that the analysis of translation equivalents displayedin sets of concordances with aligned sentences in the target language holds im-portant implications for bilingual lexicography and automatic word alignmentmethodology.

    KEYWORDS: parallel corpora, multilingual alignment, sense discrimination

    1. Introduction

    One of the stumbling blocks to machine translation is that words rarely standin stable one-to-one correspondence with each other across languages. In-stead, they typically have a ramified set of senses that would be renderedwith a set of different lexemes in the other language. Bilingual dictionariescan give only limited help in finding the appropriate lexical correspondencesbecause they provide very little information as to which alternative wouldbe suitable in the given context. In fact, as has been recently pointed out by

    INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 167177 John Benjamins Publishing Co.

  • 168 TAMAS VARADI AND GABOR KISS

    Wolfgang Teubert (1999), bilingual equivalence between dictionary entriesis very often not bi-directional. As a useful complement to bilingual dictio-naries and conceptual ontologies, Teubert recommends the corpus linguisticapproach which he illustrates through data from monolingual corpora. In thispaper, we explore this line of research by adducing evidence from parallelcorpora.

    2. The problem with bilingual dictionaries

    Even if available in electronic format, ordinary dictionaries created for humanusers present certain problems for natural language processing (Boguraev andBriscoe 1989). Some limitations derive from the fact that lexicographers in-evitably rely on the co-operation of the readers to exploit the informationcompiled in the body of dictionary entries. One source of relatively lowlevel difficulties is that all dictionaries make shortcuts in presenting the datain an effort to compress the maximal amount of information in the availablespace. Users are expected to decode and apply the formatting conventionsthat are usually set out in an introductory section. This is a task that is notalways trivial to automate as was reported by the CONCEDE project forseveral dictionaries (Erjavec et al. 1999). More serious than this proceduraldifficulty are the general deficiencies in content. As was demonstrated byTeubert (op.cit.), bilingual dictionaries tend to give a list of equivalents withvery little help (apart from usage notes) as to which one is to be used in theparticular context the dictionary user needs. There is, furthermore, an unduepreponderance of single-word equivalents at the expense of multi-word units.When one tries to look up the equivalents in the other side of a bilingual dic-tionary, one finds surprisingly few bidirectional equivalents. Teubert presentsthe situation graphically in Figure 1 through the analysis of the semanticfield associated with the German word Trauer. The figure was arrived at byfirst looking up the equivalents of Trauer in Langenscheidt EnzyklopadischesWorterbuch and successively looking up the equivalents found in one lan-guage in the other side of the dictionary until the senses started to becomeremote from the semantic field of the original word. (Bidirectional equiva-lents are printed in grey, unidirectional in black.)

  • EQUIVALENCE AND NON-EQUIVALENCE IN PARALLEL CORPORA 169

    Trauer

    Kummer

    Gram

    Leid

    Betrbnis

    Schmerz

    Pein

    [Niedergeschlagenheit]

    [Klage]

    Jammer

    Sorge

    [Trbsal]

    grief

    sorrow

    mourning

    affliction

    distress

    pain

    smart

    [suffering]

    [ruefulness]

    Figure 1. German-English translation equivalents related to Trauer (based on Teubert1999)

    3. The rationale for the present work

    In the present paper, we intend to investigate to what extent use of paral-lel corpora can help to eliminate some of the difficulties noted with bilin-gual dictionaries. It was assumed that parallel corpora are amenable to thesame procedure of traversal of translation equivalents. At the same time, thedata are sufficiently different in key aspects to warrant the replication of themethodology.

    In particular, we set out to investigate what picture emerges from parallelcorpora with regard to a) consistency of coverage (bidirectional vs. unidirec-tional equivalences), b) coverage of multiword units, collocations, c) theamount of user knowledge presupposed, and d) what implications are therefor bilingual lexicography and NLP, for example, automatic word alignment.

    4. Methodological issues

    4.1 Data and encoding scheme

    As source data, we used the sentence aligned Hungarian-English parallel cor-pus of Orwells 1984 developed in the MULTEXT-EAST project (Erjavec and

  • 170 TAMAS VARADI AND GABOR KISS

    translation unit(domain of alignment)

  • EQUIVALENCE AND NON-EQUIVALENCE IN PARALLEL CORPORA 171

    Figure 3. A sample query output

    begins in lowercase or uppercase) where the corresponding alignment unit inthe Hungarian corpus OHU includes the lemma fej.

    It should be noted that because the two parts of the parallel corpus werealigned at the sentence level, it was not possible to establish automaticallywhether the two words in the search expressions were actually translationequivalents. All that can be stated with certainty is that the aligned sentencescontained the two words in question. The corresponding parts of the alignedsentence pair had to be related manually.

    4.2 The analysis

    We have selected three English lexemes for analysis that is, head, give, andwith. By focussing on three words of so radically different parts of speech,we intended to examine whether our findings were sensitive to word classmembership. In schematic form, we carried out our analysis through thefollowing steps.

    (1) Find prototypical equivalents for the English words. For each of thethree words, it was easy to find single uncontested candidates for thisstatus: head = fej, give = ad, with = -val/-vel. (-val/-vel are variant

  • 172 TAMAS VARADI AND GABOR KISS

    forms of the instrumental suffix governed by vowel harmony.) Belowwe will refer to members of prototypical equivalent pairs simply as L1word and L2 word

    (2) Generate three concordance sets where(1) L1 word is translated with L2 word(2) L1 word is translated with non-L2 word(3) L2 word is translated with non-L1 wordTo automate this step, a perl script was developed that produced thethree sets from two words specified on the command line.

    (3) Repeat step 2) with other L1 and L2 words from 2b) and 2c) until thesemantic field of the original English lexeme seems to be saturated.

    4.3 Limitations of the approach

    Our analysis inevitably faced certain limitations owing to the scope andnature of the source data used. One immediately obvious constraint wasthe size of the data, which was about a hundred thousand words. Whilemonolingual corpora do exist for Hungarian, the Hungarian National corpuscurrently number more than 80 million words (Varadi 1999), parallel corporaare much harder to come by even for other language pairs let alone forHungarian-English. Another such corpus is Platos Republic developed as aTELRI joint research effort, but the Hungarian English alignment was notavailable at the time the work reported here was undertaken.

    Apart from size, another practical limitation was the rather limited lan-guage variety used for source data. Again, this problem could be remediedwith the extension of the data not just in terms of size but also in register.

    A third peculiarity of the data that one must bear in mind is its inherentlyunidirectional nature. Although it is tempting to look at the aligned sentencesfrom either direction, it still remains true that for any pair of sentences,one is the source and the other is the target of the translation. Hence, wesimply cannot speak of the translation equivalents of any Hungarian wordin our data. This deficiency could be compensated by involving data thatare translations of Hungarian source texts though this could raise all sorts of

  • EQUIVALENCE AND NON-EQUIVALENCE IN PARALLEL CORPORA 173

    Table 1. The actual distribution of the assumed prototypical translation equivalents

    Head other total give other total with other totalfej 45 11 56 ad 26 43 68 -val/-vel 337 412 759other 21 other 79 other 285total 66 total 105 total 622

    issues about how close a match there is between the two source texts. In thissense, there is hardly any genuine bidirectional parallel corpus.

    5. Findings

    5.1 Prototypical vs. other equivalents

    Table 1 presents the statistical summary of our findings. The figures affordseveral interesting conclusions. It appears that the fit between the actualtranslation equivalence and the presumed prototypical equivalents, as mea-sured in the ratio of the prototypical cases within the total, varies with thelanguage as well as the word class if not the individual lexeme. For exam-ple, while close to 70% of the instances of head were rendered with theexpected prototypical equivalent fej in Hungarian, the same ratio for with is54% and give is translated with ad in less than 25% of the cases. If we lookat all occurrences of the Hungarian equivalents, we find the same rankingof the items in terms of the ratio of the prototypical equivalent to all othertranslation equivalents (fej, -val/-vel, ad) but at a higher level (80%, 44%,38%). Recall that given the unidirectional nature of our text data, one shouldinterpret the Hungarian figures for fej, for example, as the number of timesfej was used as the translation of head vs. other English words.

    5.2 A close up profile

    Figure 4 shows the distribution of all the translation equivalents of the wordhead found in our data. The figure on the left shows a complete listing ofthe equivalents summarized in Table 1. It traces the corresponding itemsin both directions at one level of depth. The graphic display of the othervariants make it immediately clear that the spread of the Hungarian translation

  • 174 TAMAS VARADI AND GABOR KISS

    head

    mind

    skull

    face

    pop up

    0

    fej head

    blint nod

    tark back of the head

    agy mind, brain

    eszbe jut come to mind

    koponya skull

    haj hair

    szj mouth

    arckp portrait

    felfigyel listen up

    sajt maga by ones own effort

    osztlyvezet department head

    ltra vge head of the ladder

    45

    4

    3

    3

    3

    3

    4

    2

    head

    mind

    skull

    fej head

    blint nod

    tark back of the head

    agy mind, brain

    eszbe jut come to mind

    llek mind, soul

    tudat consciousness

    gondolat thought

    elme mind

    koponya skull

    csontvz skeleton

    45

    4

    3

    3

    3

    3

    2

    33

    11

    11

    6

    6

    5

    6

    7

    Figure 4. Tracing the translation equivalents of head

    equivalents of head is much broader than the range of words rendered as fej.Except for the last two cases, all uses of the word head made reference tothe body part in a non-metaphorical sense. It is interesting to note that someuses were rendered with a verb or verb phrase (eszebe jutcome to oneshead, felfigyel listen upraised his head). Also note the number of cases(4) where there was no English source for fej at all.

    The diagram on the right in Figure 4 traverses the links between trans-lation equivalents one step further, eliminating, for clarity, all the cases witha single occurrence.

    Figure 5 displays some instances where the Hungarian equivalents ofhead (i.e., szaj mouth, sajat maga by ones own effort) may seem oddwhen viewed purely at the lexical level. On the strength of the Englishexamples alone, it is easy to see that there is nothing contrived about theuses involved here; yet, any lexicographer would probably feel reluctant tointroduce such equivalence as headszaj mouth in a dictionary.

    The richness of variety of data brought to the fore with this methodis further illustrated by Figure 6 displaying the different expressions ren-dered as eszebe jut come to mind. It is important to note that none of thecorresponding items is a single-word unit.

    When we turn to the verb give, we find that it is practically impossibleto even describe the bare items standing in correspondence without makingrecourse to multi-word expressions. The data in Figure 6 clearly argues for

  • EQUIVALENCE AND NON-EQUIVALENCE IN PARALLEL CORPORA 175

    head - haj hair

    : He plucked at Winstons and brought away a tuft of hair.

    -->ohu: Belemarkolt Winston hajba, s kihzott egy csomt.

    head -- szj mouth

    : And the few you have left are dropping out of your .

    -->ohu: S az a nhny is, ami mg megvan, kiesik a szdbl.

    head -- sajt maga

    : A great deal of the time you were expected to make the up

    out of your

    -->ohu: Az ember gyakran sajt maga volt knytelen kitallni.

    head -- felfigyel

    : Winston raised his to listen

    -->ohu: Winston felfigyelt.

    Figure 5. A sample of the translation equivalents in context

    the importance of the context in defining the units that enter into a bilingualequivalence relationship. Note that without considering the context in whichthe translation equivalents occur, one may get paradoxical correspondenceslike give = kap get. Such cases can only be interpreted if the differentorganisation of the sentences in which they occur are also considered. In othercases, it is enough to make reference to the typical objects with which theitem co-occurs. One feature in Hungarian that provides for a proliferation ofequivalents in English is the presence of the coverb particle that creates newmeanings of the verb stem that are often rendered in English with a separatelexeme. Examples include kiadpublish/issue, atadhand over, eladsell,etc.

    6. Conclusions

    We do not have the space here to discuss the data uncovered by the analysisin the detail that it clearly merits. However, we are positive that the evidencepresented above is sufficient to draw the following conclusions.

    The corpus linguistic approach advocated by Teubert has received amplecorroboration from the bilingual corpus evidence presented. On the issue

  • 176 TAMAS VARADI AND GABOR KISS

    give

    was given as

    give sb. the impression

    give a glance

    give his name

    give way to

    give off (smell)

    give one away

    give up trying

    at any given moment

    by a given date

    with no reason given

    dont give a damn

    kap get, receive

    jelltk meg was marked

    az volt az rzse had the feeling

    krlnzett looked around

    megmond say

    kvetkezett followed

    raszt ooze

    elrulhat could betray

    felhagynak abandon/cease to do

    mindig always

    zros hatridn bell

    minden indokls nlkl

    senkinek sem akarunk rtani

    Figure 6. Translation equivalents of give

    of the bi-directionality of equivalents, we also did not find that the set ofbilingual equivalents formed a closed set. However, this may well be due tothe sparseness of data used in this pilot experiment.

    Our findings have important implications for bilingual lexicography. Themost important point to note is the vital need to integrate corpus evidence.Enriching the dictionary with contextual evidence serves to eliminate severalshortcomings noted earlier. As the data is embedded in context, it will almostinevitably bring with it guidance as to how the particular item is to be used.Showing usage through examples will also obviate the need for terse andhighly abstract formulations. True, the intuitions of the dictionary users arerequired here too, but developing intuitions through actual language data isa task that the average human user is better equipped to handle than dealingwith an arid list of bilingual equivalents.

    More extensive direct integration of the context should also narrow thecurrent gap between lexical and textual equivalence. We have presented nu-merous examples for translation equivalents that make perfect sense in theparticular context they are used which, however, may seem puzzling if notdownright false when viewed out of context. Any attempt to base definitionson real translation equivalence will result in more numerous use of multi-word expressions simply because most of the time it is just not feasible andcertainly not practicable to tease out some single word and equate it with

  • EQUIVALENCE AND NON-EQUIVALENCE IN PARALLEL CORPORA 177

    another one in the target language at a relatively abstract level. The datapresented in the paper clearly suggests that one can only do justice to theintricate and rich texture of context if the two languages are related not atthe lexical level of the word but rather at the contextual level embodied inphrases.

    The difficulties of pinning down bilingual equivalence on the individualwords also has implications for automatic word alignment methodology. Nomatter how wide a window one establishes within which to scan for equiva-lents, as long as the search is centred on individual words, the procedure isfaced with serious limitations.

    Notes

    * The research reported in the paper was supported by Orszagos Tudomanyos KutatasiAlapprogramok (grant number T026091)

    References

    Boguraev, B. and E. J. Briscoe. 1989. Computational Lexicography for Natural LanguageProcessing. London and New York: Longman.

    Christ, Oliver. 1994. A Modular and Flexible Architecture for an Integrated CorpusQuery System. Papers in Computational Lexicography COMPLEX94, ed. by Kieferet al., 2332. Budapest: Linguistics Institute, Hungarian Academy of Sciences.

    Erjavec, Tomaz and Nancy Ide. 1998. The MULTEXT-East corpus. First InternationalConference on Language Resources and Evaluation, LREC 98, ed. by Rubio et al.,971974. Granada: ELRA.

    Erjavec, Tomaz. 1999. Making the ELAN Slovene/English corpus. Proceedings of theWorkshop Language TechnologiesMultilingual Aspects, ed. by Spela Vintar., 2330.Ljubljana: Department of Translation and Interpreting.

    Erjavec, Tomaz, Dan Tufis, and Tamas Varadi. 1999. Developing TEIConformantLexical Databases for CEE Languages. Papers in Computational LexicographyCOMPLEX99, ed. by Kiefer et al., 205209. Budapest: Linguistics Institute,Hungarian Academy of Sciences.

    Teubert, Wolfgang. 1999. Starting with Trauer. Approaches to Multilingual LexicalSemantics. Papers in Computational Lexicography COMPLEX99, ed. by Kiefer etal., 153169. Budapest: Linguistics Institute, Hungarian Academy of Sciences.

    Varadi, Tamas. 1999. On Developing the Hungarian National Corpus in Vintar Spela(ed.): Proceedings of the Workshop Language TechnologiesMultilingual Aspects,32nd Annual Meeting of the Societas Linguistica Europea, Ljubljana, 5763.

    Equivalence and Non-equivalence in Parallel Corpora1. Introduction2. The problem with bilingual dictionaries3. The rationale for the present work4. Methodological issues4.1. Data and encoding scheme4.2. The analysis4.3. Limitations of the approach

    5. Findings5.1. Prototypical vs. other equivalents5.2. A close up profile

    6. ConclusionsNotesReferences