Spotting Translationese: An Empirical Approach

23
Spotting Translationese: An Empirical Approach Pau Giménez Flores Supervisors: Carme Colominas and Toni Badia Universitat Pompeu Fabra

description

This research aims to give empirical evidence of the phenomenon of translationese, which has been defined as the dialect, sub-language or code of translated language. The evidence of translationese has been empirically demonstrated through isolated phenomena in particular language pairs, but there has not been a systematical study involving more than two languages. We have not either found any previous study of translationese in Catalan so far. We intend to prove the translationese hypothesis: first in a corpus of original and translated Catalan; secondly, in other languages such as Spanish, French, English and German by reusing the previous methodology. Thus, we will try to demonstrate that translationese is empirically observable and automatically detectable. The goal is therefore to define which patterns of translation are universal across languages and which are source language or target language-dependent. The data collected and the resources created for identifying lexical, morphological and syntactic patterns of translations can be of great help for Translation Studies teachers, scholars and students: teachers will have tools to help students avoid the reproduction of translationese patterns. Resources previously developed will help in detecting non-genuine words and inadequate structures in the target language. This fact would imply an improvement in stylistic quality in translations. Machine Translation companies can also take advantage of our resources in order to improve their translation quality.

Transcript of Spotting Translationese: An Empirical Approach

Page 1: Spotting Translationese: An Empirical Approach

Spotting Translationese: An Empirical Approach

Pau Giménez FloresSupervisors: Carme Colominas and Toni Badia

Universitat Pompeu Fabra

Page 2: Spotting Translationese: An Empirical Approach

Content

1. Translationese2. Goals3. Translation Universals4. Empirical Methods in Translation Studies5. Theoretical Framework6. Hypotheses7. Methodology8. Working Plan9. Commented Bibliography

Page 3: Spotting Translationese: An Empirical Approach

Translationese• A product of the incompetence of the translator (translation errors):

–“unusual distribution of features is clearly a result of the translator’s inexperience or lack of competence in the target language” (Baker, 1998: 248)

• Translation-specific language or dialect, without any negative connotations (translation universals):

–Third code “which arises out of the bilateral consideration of the matrix and target codes: it is, in a sense, a sub-code of each of the codes involved” (Frawley, 1984: 168).

–Translationese: set of linguistic features of translated texts which are different both from the source language and the target language (Gellerstam, 1986).

Page 4: Spotting Translationese: An Empirical Approach

Goals

• Main goal: validating the hypothesis of translationese empirically.– Capturing the linguistic properties of translationese

in observable and refutable facts.– Detecting and classifying automatically translated vs.

non-translated texts based on its syntactic and lexical properties.

Page 5: Spotting Translationese: An Empirical Approach

Translation Universals (1)

“Features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems” (Baker, 1993: 243)

Page 6: Spotting Translationese: An Empirical Approach

Translation Universals (2)

• Explicitation or explicitness: translations tend to be more explicit than source texts– Repetition of redundant grammatical items (i.e.

prepositions)– Optional that-connective is more frequent in

reported speech in translated English (Olohan and Baker, 2000).

Page 7: Spotting Translationese: An Empirical Approach

Translation Universals (3)

• Simplification: the language of translations is assumed to be lexically and syntactically simpler than that of non-translated target language texts.– Narrower range of vocabulary: lower type-token ratio.– Lower level of information load: lower lexical density

Page 8: Spotting Translationese: An Empirical Approach

Translations Universals (4)

• Normalization: exaggeration of typical features of the target language. Translations tend to be more unmarked and conventional, less creative, more conservative. – Conventionalization of metaphors and idioms.– Dialectal and colloquial expressions less frequent.– Lexical choice of ‘standard translation’ (Gellerstam, 1986).

Page 9: Spotting Translationese: An Empirical Approach

Translations Universals (5)

• Interference from the source text and language (Toury, 1995; Mauranen, 2000). It can occur in the morphological, lexical, syntactic level, etc.

• Unique items hypothesis (Tirkkonen-Condit, 2002): translated texts “manifest lower frequencies of linguistic elements that lack linguistic counterparts in the source languages such that these could also be used as translations equivalents” (Simplification, Normalization?)

Page 10: Spotting Translationese: An Empirical Approach

Translations Universals (6)

However,The as yet relatively small amount of research into potential translation universals has produced contradictory results, which seems to suggest that a search for real, ‘unrestricted’ universals in the field of translation might turn out to be unsuccessful.

Puurtinen (2003: 403)

Page 11: Spotting Translationese: An Empirical Approach

Empirical Methods in TS (1)

• Laviosa-Braithwaite, (1996): study of the linguistic nature of English translated text in a subsection of the English Comparable Corpora (ECC).

• Øverås (1998): investigation of explicitation in translational English and translational Norwegian.

• Olohan and Baker (2000): testing of the explicitation hypothesis based on the omission and inclusion of the reporting that in translational and original English.

Page 12: Spotting Translationese: An Empirical Approach

Empirical Methods in TS (2)

• Borin and Prütz (2001): study of original newspaper articles in British and American English with articles translated from Swedish into English with POS n-gram tags.

• Puurtinen (2003): research of potential features of translationese in a corpus of Finnish translations of children’s books.

Page 13: Spotting Translationese: An Empirical Approach

Empirical Methods in TS (3)

• Baroni and Bernardini (2006): application of supervised machine learning techniques (SVMs) to detect translationese on two monolingual corpora of translated and original Italian texts.

Page 14: Spotting Translationese: An Empirical Approach

Empirical Methods in TS (4)

• Rayson et al (2008): a descriptive study of translationese by comparing keyword, keyword classes (POS) and key semantic tags frequencies in original Chinese, translated English and edited translated English corpora.

• Tirkonnen-Condit (2002): Translationese – a myth or an empirical fact? Human translators did not identify well if a text was translated or not.

Page 15: Spotting Translationese: An Empirical Approach

Theoretical FrameworkCrossroad of Corpus Linguistics, Translation Studies

and Computational Linguistics• It is an empirical research where corpora are the main

source of data and source of hypotheses (Laviosa-Braithwaite, 1996; Olohan and Baker, 2000, etc.)

• It tries to validate the existence of translationese and to define the linguistic properties of translated language as a product. (Gellerstam, 1986; Baker, 1993, etc.)

• Use of Computational Linguistic techniques such as information extraction and machine learning algorithms (Kindermann et al., 2003; Baroni and Bernardini, 2006)

Page 16: Spotting Translationese: An Empirical Approach

Hypotheses

1. Translationese exists and it is observable across languages.

2. This fact can be demonstrated with empirical methods applied to corpora in different languages.

Page 17: Spotting Translationese: An Empirical Approach

Methodology (1)

Preliminary Study Two monolingual comparable corpora of original and translated Catalan of art and architecture. 300.000 tokens each.

• Corpus Building– Corpus compilation– Tokenization, tagging and parsing with CatCG (Alsina, Badia et al. 2002)

• Corpus Exploitation– Exploitation with Wordsmith Tools (wordlists, frequency lists, type-token ratio, lexical density, concordance lists)– Implementation of scripts to extract collocations and POS n-grams with Python and NTLK

• Implementation of a Machine Learning System– Machine Learning techniques (SVMs) in order to automatically classify texts in translated and not translated. – Training a set of the corpus and testing (Weka software).

Page 18: Spotting Translationese: An Empirical Approach

Methodology (2)Main experiment• Corpus Building

– Corpus compilation (Spanish, French, English, German)– Tokenization, tagging and parsing

• Corpus Exploitation– Exploitation with Wordsmith Tools (wordlists, frequency lists, type-token ratio, lexical density, concordance lists)– Implementation of scripts to extract collocations and POS n-grams with Python and NTLK

• Implementation of a Machine Learning System– Machine Learning techniques (SVMs) in order to automatically classify texts in translated and not translated. – Training a set of the corpus and testing (Weka software).

Page 19: Spotting Translationese: An Empirical Approach

Working Plan

Page 20: Spotting Translationese: An Empirical Approach

Commented Biblography (1)• Baker, M. (1995). Corpora in Translation Studies: An Overview and Some

Suggestions for Future Research. Target 7, 2: 223-243.

– Definition of a new type of corpora: monolingual comparable corpora in order to “effect a shift away from comparing either ST with TT or language A with language B to comparing text production per se with translation.” – Type-token ratio, lexical density measures.

• Borin, L. and Prütz, K. (2001). Through a Glass Darkly: Part-of-speech Distribution in Original and Translated Text, in Computational linguistics in the Netherlands 2000, 30-44.

– Comparison of POS n-grams in order to determine if there are significant syntactical differences between original and translated language. – Overuse in translated English of preposition-initial sentences and sentence-initial adverbs.

Page 21: Spotting Translationese: An Empirical Approach

Commented Biblography (2)

• Kindermann et al. (2003). Authorship attribution with support vector machines. Applied Intelligence 19, 109-123.

– Different statistical techniques for authorship attribution are described: the log-likelihood ratio statistic, naïve bayesian probabilistic classifiers, multi-layer perceptrons, k-nearest neighbour classification (kNN), Support Vector Machines (SVMs), etc. – SVMs achieve better results than other classifiers in author attribution: they are fast and allow a great number of features as input.

Page 22: Spotting Translationese: An Empirical Approach

Commented Biblography (3)

• Baroni, M. and Bernardini, S. (2006). A New Approach to the Study of Translationese: Machine-Learning the Difference between Original and Translated text, Literary and Linguistic Computing (2006) 21(3). 259-274

– A new explicit criterion to prove the existence of translationese: learnability by a machine. – SVMs allow the utilization of a big amount of features.– The application of SVMs achieve better results than professional human translators.– Their results show that translations are recognizable on purely grammatical/syntactic grounds (function words distribution and shallow syntactic patterns).

Page 23: Spotting Translationese: An Empirical Approach

Commented Biblography (4)

• Tirkkonen-Condit, S. (2002). Translationese – a Myth or an Empirical Fact? Target, 14 (2): 207–20.

– The hypothesis of translationese is, at least, controversial, whereas the unique items hypothesis can describe in a better way the translated or non-translated nature of a text. – Translated texts “manifest lower frequencies of linguistic elements that lack linguistic counterparts in the source languages such that these could also be used as translation equivalents”.