Spotting Translationese: An Empirical Approach

Spotting Translationese: An Empirical Approach

Pau Giménez FloresSupervisors: Carme Colominas and Toni Badia

Universitat Pompeu Fabra

Content

1. Translationese2. Goals3. Translation Universals4. Empirical Methods in Translation Studies5. Theoretical Framework6. Hypotheses7. Methodology8. Working Plan9. Commented Bibliography

Translationese• A product of the incompetence of the translator (translation errors):

–“unusual distribution of features is clearly a result of the translator’s inexperience or lack of competence in the target language” (Baker, 1998: 248)

• Translation-specific language or dialect, without any negative connotations (translation universals):

–Third code “which arises out of the bilateral consideration of the matrix and target codes: it is, in a sense, a sub-code of each of the codes involved” (Frawley, 1984: 168).

–Translationese: set of linguistic features of translated texts which are different both from the source language and the target language (Gellerstam, 1986).

Goals

• Main goal: validating the hypothesis of translationese empirically.– Capturing the linguistic properties of translationese

in observable and refutable facts.– Detecting and classifying automatically translated vs.

non-translated texts based on its syntactic and lexical properties.

Translation Universals (1)

“Features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems” (Baker, 1993: 243)


• Explicitation or explicitness: translations tend to be more explicit than source texts– Repetition of redundant grammatical items (i.e.

prepositions)– Optional that-connective is more frequent in

reported speech in translated English (Olohan and Baker, 2000).


• Simplification: the language of translations is assumed to be lexically and syntactically simpler than that of non-translated target language texts.– Narrower range of vocabulary: lower type-token ratio.– Lower level of information load: lower lexical density

Translations Universals (4)

• Normalization: exaggeration of typical features of the target language. Translations tend to be more unmarked and conventional, less creative, more conservative. – Conventionalization of metaphors and idioms.– Dialectal and colloquial expressions less frequent.– Lexical choice of ‘standard translation’ (Gellerstam, 1986).


• Interference from the source text and language (Toury, 1995; Mauranen, 2000). It can occur in the morphological, lexical, syntactic level, etc.

• Unique items hypothesis (Tirkkonen-Condit, 2002): translated texts “manifest lower frequencies of linguistic elements that lack linguistic counterparts in the source languages such that these could also be used as translations equivalents” (Simplification, Normalization?)


However,The as yet relatively small amount of research into potential translation universals has produced contradictory results, which seems to suggest that a search for real, ‘unrestricted’ universals in the field of translation might turn out to be unsuccessful.

Puurtinen (2003: 403)

Empirical Methods in TS (1)

• Laviosa-Braithwaite, (1996): study of the linguistic nature of English translated text in a subsection of the English Comparable Corpora (ECC).

• Øverås (1998): investigation of explicitation in translational English and translational Norwegian.

• Olohan and Baker (2000): testing of the explicitation hypothesis based on the omission and inclusion of the reporting that in translational and original English.


• Borin and Prütz (2001): study of original newspaper articles in British and American English with articles translated from Swedish into English with POS n-gram tags.

• Puurtinen (2003): research of potential features of translationese in a corpus of Finnish translations of children’s books.


• Baroni and Bernardini (2006): application of supervised machine learning techniques (SVMs) to detect translationese on two monolingual corpora of translated and original Italian texts.


• Rayson et al (2008): a descriptive study of translationese by comparing keyword, keyword classes (POS) and key semantic tags frequencies in original Chinese, translated English and edited translated English corpora.

• Tirkonnen-Condit (2002): Translationese – a myth or an empirical fact? Human translators did not identify well if a text was translated or not.

Theoretical FrameworkCrossroad of Corpus Linguistics, Translation Studies

and Computational Linguistics• It is an empirical research where corpora are the main

source of data and source of hypotheses (Laviosa-Braithwaite, 1996; Olohan and Baker, 2000, etc.)

• It tries to validate the existence of translationese and to define the linguistic properties of translated language as a product. (Gellerstam, 1986; Baker, 1993, etc.)

• Use of Computational Linguistic techniques such as information extraction and machine learning algorithms (Kindermann et al., 2003; Baroni and Bernardini, 2006)

Hypotheses

1. Translationese exists and it is observable across languages.

2. This fact can be demonstrated with empirical methods applied to corpora in different languages.

Methodology (1)

Preliminary Study Two monolingual comparable corpora of original and translated Catalan of art and architecture. 300.000 tokens each.

• Corpus Building– Corpus compilation– Tokenization, tagging and parsing with CatCG (Alsina, Badia et al. 2002)

• Corpus Exploitation– Exploitation with Wordsmith Tools (wordlists, frequency lists, type-token ratio, lexical density, concordance lists)– Implementation of scripts to extract collocations and POS n-grams with Python and NTLK

• Implementation of a Machine Learning System– Machine Learning techniques (SVMs) in order to automatically classify texts in translated and not translated. – Training a set of the corpus and testing (Weka software).

Methodology (2)Main experiment• Corpus Building

– Corpus compilation (Spanish, French, English, German)– Tokenization, tagging and parsing

• Corpus Exploitation– Exploitation with Wordsmith Tools (wordlists, frequency lists, type-token ratio, lexical density, concordance lists)– Implementation of scripts to extract collocations and POS n-grams with Python and NTLK

• Implementation of a Machine Learning System– Machine Learning techniques (SVMs) in order to automatically classify texts in translated and not translated. – Training a set of the corpus and testing (Weka software).

Working Plan

Commented Biblography (1)• Baker, M. (1995). Corpora in Translation Studies: An Overview and Some

Suggestions for Future Research. Target 7, 2: 223-243.

– Definition of a new type of corpora: monolingual comparable corpora in order to “effect a shift away from comparing either ST with TT or language A with language B to comparing text production per se with translation.” – Type-token ratio, lexical density measures.

• Borin, L. and Prütz, K. (2001). Through a Glass Darkly: Part-of-speech Distribution in Original and Translated Text, in Computational linguistics in the Netherlands 2000, 30-44.

– Comparison of POS n-grams in order to determine if there are significant syntactical differences between original and translated language. – Overuse in translated English of preposition-initial sentences and sentence-initial adverbs.

Commented Biblography (2)

• Kindermann et al. (2003). Authorship attribution with support vector machines. Applied Intelligence 19, 109-123.

– Different statistical techniques for authorship attribution are described: the log-likelihood ratio statistic, naïve bayesian probabilistic classifiers, multi-layer perceptrons, k-nearest neighbour classification (kNN), Support Vector Machines (SVMs), etc. – SVMs achieve better results than other classifiers in author attribution: they are fast and allow a great number of features as input.


• Baroni, M. and Bernardini, S. (2006). A New Approach to the Study of Translationese: Machine-Learning the Difference between Original and Translated text, Literary and Linguistic Computing (2006) 21(3). 259-274

– A new explicit criterion to prove the existence of translationese: learnability by a machine. – SVMs allow the utilization of a big amount of features.– The application of SVMs achieve better results than professional human translators.– Their results show that translations are recognizable on purely grammatical/syntactic grounds (function words distribution and shallow syntactic patterns).


• Tirkkonen-Condit, S. (2002). Translationese – a Myth or an Empirical Fact? Target, 14 (2): 207–20.

– The hypothesis of translationese is, at least, controversial, whereas the unique items hypothesis can describe in a better way the translated or non-translated nature of a text. – Translated texts “manifest lower frequencies of linguistic elements that lack linguistic counterparts in the source languages such that these could also be used as translation equivalents”.

Spotting Translationese: An Empirical Approach

Technology

Transcript of Spotting Translationese: An Empirical Approach