Corpus Linguistics Notes

47

description

A series of notes on corpus linguistics.

Transcript of Corpus Linguistics Notes

Page 1: Corpus Linguistics Notes
Page 2: Corpus Linguistics Notes
Page 3: Corpus Linguistics Notes

CASSCorpus Approaches to Social Science

Page 4: Corpus Linguistics Notes

Using comparable and parallel corpora in contrastive and translation studies

Page 5: Corpus Linguistics Notes

Richard Xiao Lancaster University

Page 6: Corpus Linguistics Notes

Outline of the session •  Types of corpora used in translation and contrastive

studies

•  Paradigmatic shift in contrastive and translation studies

•  A model of Contrastive Corpus Linguistics

•  Alignment and parallel concordancing

•  Corpus resources and tools

Page 7: Corpus Linguistics Notes

Types of corpora: Some distinctions

•  Monolingual versus multilingual corpora

•  Parallel versus comparable corpora

•  Comparable versus comparative corpora

Page 8: Corpus Linguistics Notes

Monolingual vs. multilingual corpora •  Monolingual corpora

•  A corpus that only involves one language •  Multilingual corpora

•  A corpus that contains texts of more than one language •  A corpus covering two languages is conventionally

known as ‘bilingual’ •  Multilingual corpora, in a narrow sense, must involve

more than two languages •  ‘Multilingual’ and ‘bilingual’ are often used

interchangeably •  Parallel and comparable corpora

Page 9: Corpus Linguistics Notes

Parallel vs. comparable corpora •  Terminological confusion centres around the terms •  For some scholars (e.g. Aijmer & Altenberg 1996; Granger 1996: 38)

•  Corpora composed of source texts in one language and their translations in another language (or other languages) are ‘translation corpora’ while those comprising different components sampled from different native languages using comparable sampling techniques are called ‘parallel corpora’

•  For many others (e.g. Baker 1993: 248, 1995, 1999; Barlow 1995, 2000: 110; Hunston 2002: 15; McEnery and Wilson 1996: 57; McEnery, Xiao & Tono 2006) •  Corpora of the first type are labelled ‘parallel corpora’ while

those of the latter type are ‘comparable corpora’

Page 10: Corpus Linguistics Notes

Parallel vs. comparable corpora •  Consistent and logical ways of doing things…

•  We can say a corpus is a translation or a non-translation corpus if the criterion of corpus content is used

•  But if we choose to define corpus types by the criterion of corpus form, we must use the criterion consistently •  We can say a corpus is parallel if the corpus contains source

texts and translations in parallel, or it is a comparable corpus if its components or subcorpora are comparable by applying the same sampling techniques and representing similar balance

•  It is simply inconsistent and illogical to refer to corpora of the first type as ‘translation corpora’ by the criterion of content while referring to corpora of the latter type as ‘comparable corpora’ by the criterion of form!

Page 11: Corpus Linguistics Notes

Multilingual vs. monolingual comparable corpora •  A common practice in TS is to compare a corpus of translated texts

(‘translational corpus’) with a corpus comprising comparably sampled non-translated native texts in the target language •  The ZJU Corpus Translation Chinese (ZCT C) vs. the Lancaster

Corpus of Mandarin Chinese (LCMC ) •  The two sub-corpora form monolingual comparable corpora, as

opposed to multilingual comparable corpora composed of comparable texts for different languages (LCMC s. FLOB)

Page 12: Corpus Linguistics Notes

Comparative corpora •  Corpora containing different varieties of the same

language are not comparable corpora •  e.g. the International Corpus of English (ICE); the

Brown family of corpora •  All corpora, as a resource for linguistic research, are well suited for comparative studies, in either intralingual or interlingual research

•  Corpora of this kind are comparative corpora

Page 13: Corpus Linguistics Notes

Use of parallel & comparable corpora •  Parallel and comparable corpora “offer specific uses and

possibilities” for contrastive and translation studies (Aijmer & Altenberg 1996:12) •  Giving new insights into the languages compared – insights

that are not likely to be gained from the study of monolingual corpora

•  Used for a range of comparative purposes and increasing our knowledge of language-specific, typological and cultural differences, as well as of universal features

•  Illuminating differences between source texts and translations, and between native and non-native texts

•  Used for a number of practical applications, e.g. in lexicography, language teaching and translation

Page 14: Corpus Linguistics Notes

Use of parallel & comparable corpora •  Used primarily for translation and contrastive studies •  The two types of corpora have their own characteristics, and serve

different purposes •  Parallel corpora: useful in translation studies, but they alone

serve as a poor basis for cross-linguistic contrast, because translations cannot avoid the effect of translationese

•  Comparable corpora: well suited for contrastive research, but are less useful in translation studies, e.g. in studying translation equivalents

Page 15: Corpus Linguistics Notes

Using corpora in translation studies •  Translational corpora

•  Used in combination with a comparable TL corpus to provide primary evidence in product-oriented Translation Studies, and in studies of “translation universals”

•  If corpora of this kind are encoded with sociolinguistic and cultural parameters, they can also be used to study the sociocultural environment of translations

•  Monolingual SL and TL corpora •  Can raise the translator’s linguistic and cultural awareness in

general •  A useful and effective reference tool for translators •  Used in combination with a parallel corpus to form a so-called

‘translation evaluation corpus’: helping translator trainers or critics to evaluate translations more effectively and objectively

Page 16: Corpus Linguistics Notes

Using corpora in translation studies •  Parallel corpora

•  Useful in exploring how an idea in one language is conveyed in another language, thus providing indirect evidence to the study of the translation process

•  Indispensable for building statistical or example-based machine translation (EBMT) systems, and for the development of bilingual lexicons and translation memories

•  Parallel concordancing is a useful tool for translators •  Comparable corpora of SL and TL

•  Useful in improving the translator’s understanding of the subject field and improving the quality of translation in terms of fluency, correct term choice and idiomatic expressions in the chosen field

•  Can also be used to build terminology banks

Page 17: Corpus Linguistics Notes

Corpora in contrastive linguistics •  Contrastive analysis

•  An important part of FLT methodology following WWII and remained dominant throughout the 1960s

•  Lost ground to more learner-oriented approaches e.g. error analysis, performance analysis, and interlanguage analysis

•  Revived in the 1990s •  The rapid development of corpus linguistics has been

recognized as a principal reason for its revival (cf. Salkie 2002; Xiao & McEnery 2010; Xiao 2011)

Page 18: Corpus Linguistics Notes

Corpora in contrastive linguistics •  The marriage of corpus linguistics and contrastive analysis is an

entirely natural one •  Corpus linguistics is inherently comparative in nature •  The combination of corpus analysis and contrastive analysis can

produce a synergy that can and has benefited both corpus linguistics and contrastive analysis

•  Corpora have “always been pre-eminently suited for comparative studies” (Aarts 1998:ix) •  Corpora of the Brown family (Lancaster 1931, LOB, FLOB, BE2006;

B-Brown, Brown, Frown, AE2006) •  Even the BNC, which is designed balanced corpus representing

modern British English in general, provides a useful basis for various intra-lingual comparisons

Page 19: Corpus Linguistics Notes

Corpora in contrastive linguistics •  Corpus analysis techniques are also intrinsically

comparative •  keyword analysis •  collocation analysis •  interlanguage analysis

•  Corpus-based contrastive linguistics has emerged with a wealth of methodologies, addressing a wide spectrum of cross-linguistic issues (cf. Altenberg & Granger 2002; Granger 2003)

Page 20: Corpus Linguistics Notes

Corpus-based Translation Studies •  Laviosa (1998a): “the corpus-based approach is evolving,

through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation.” •  Hypothesis that translation universals can be identified and tested by using corpus data (Baker 1993, 1995)

•  Rapid development of corpus linguistics, especially multilingual corpus research in the early 1990s

•  Increasing interest in Descriptive Translation Studies (Toury 1995)

Page 21: Corpus Linguistics Notes

Corpus-based Translation Studies •  Tymoczko (1998): “Corpus Translation Studies is central to the way that

Translation Studies as a discipline will remain vital and move forward.”

•  Meta 43/4 (1998); Kenny (2001); Bowker (2002); Laviosa (2002); Granger et al (2003); Teich (2003); Zanettin et al (2003); Mauranen et al (2004); Olohan (2004); Santos (2004); Rogers & Anderman (2007); Beeby et al (2009); Saldanha (2009); Hruzov (2010); Izwaini (2010); Tengku et al (2010); Véronis (2010); Xiao (2010, 2011, 2012); Hu 2011; Kruger et al (2011); Wang 2012

•  Corpus-based Translation Studies book series (Shanghai Jiao Tong University Press / Springer)

Page 22: Corpus Linguistics Notes

The Holmes-Toury map •  Applied Translation Studies

•  Descriptive Translation Studies

•  Theoretical Translation Studies

Page 23: Corpus Linguistics Notes

Applied Translation Studies •  Three major contributions of corpora

•  Corpus-assisted translating •  Bowker (1998: 631): “corpus-assisted translations are of a higher

quality with respect to subject field understanding, correct term choice and idiomatic expressions.”

•  Corpus-aided translation teaching and training •  Bernardini (1997): ‘large corpora concordancing’ (LCC) can help

students to develop ‘awareness’, ‘reflectiveness’ and ‘resourcefulness’, which are the skills that distinguish a professional translator from those unskilled amateurs

•  Development of translation tools •  Corpora, and especially aligned parallel corpora, are essential for

the development of translation technology such as machine translation (MT) systems, and computer-aided translation (CAT) tools and translation memories (TM)

Page 24: Corpus Linguistics Notes

Descriptive Translation Studies •  Characterized by its emphasis on the study of

translation per se, aiming to answer the question of “why a translator translates in this way” instead of “how to translate”

•  Baker (1993) predicted that the availability of large corpora of both source and translated texts, together with the development of the corpus-based approach, would enable translation scholars to uncover the nature of translation as a mediated communicative event

Page 25: Corpus Linguistics Notes

Descriptive Translation Studies •  Three focuses (Holmes 1972/1988) •  The function of translation

•  Concerned with the study of contexts rather than texts: e.g. function or impact of a translation work

•  Relatively few function-oriented studies that are corpus-based •  Translation as a process

•  Aiming to reveal the thought processes that take place in the mind of the translator while they are translating

•  One possible way for corpus-based DTS is to analyze the written transcripts of these recordings off-line (Think-Aloud Protocols, or TAPs)

•  Research of translation as a product can also provide indirect evidence to translation as a process (product vs. process)

•  Translation as a product •  Concerned with describing translation as a product by comparing

comparable corpora of translated and non-translated texts in TL •  Attempting to uncover evidence to validate / invalidate the so-called

translation universal hypotheses

Page 26: Corpus Linguistics Notes

Descriptive Translation Studies •  Core patterns of lexical use (Laviosa 1998b)

•  A relatively low proportion of lexical words over function words

•  A relatively high proportion of high-frequency words over low-frequency words

•  A relatively great repetition of the most frequent words

•  Less variety in most frequently used words

Page 27: Corpus Linguistics Notes

Descriptive Translation Studies •  Beyond the lexical level

•  Simplification: “tendency to simplify the language used in translation” (Baker 1996: 181-182)

•  Normalisation: “tendency to exaggerate features of the target language and to conform to its typical patterns” (Baker 1996: 183)

•  Explicitation: translations tend to “spell things out rather than leave them implicit” (Baker 1996: 180)

•  Sanitisation: translated texts are “somewhat ‘sanitised’ versions of the original” (Kenny 1998: 515)

•  Leveling out (convergence): “tendency of translated text to gravitate towards the centre of a continuum” (Baker 1996: 184)

Page 28: Corpus Linguistics Notes

Theoretical Translation Studies •  Aims “to establish general principles by means of which

these phenomena can be explained and predicted” (Holmes 1988: 71) •  Closely related to, and often reliant on the empirical

findings produced by Descriptive Translation Studies

•  One good battleground of using DTS findings to pursue general theory of translation is the hypothesis of so-called “translation universals” (TUs) – the inherent common features of translational language •  An important area of corpus-based TS over the past

decade

Page 29: Corpus Linguistics Notes

Contrastive Corpus Linguistics •  Bringing together the strengths of contrastive analysis and

corpus analysis •  This synergy has not only revived contrastive analysis but

has also expanded the fields of corpus linguistics, translation studies, and SLA research

•  A new model of Contrastive Corpus Linguistics (Xiao & McEnery 2010) to demonstrate the promise and potential value of the corpus-based approach to contrastive and translation studies •  Common platform for research areas including corpus

linguistics, contrastive linguistics, translation studies, and SLA

Page 30: Corpus Linguistics Notes

Contrastive Corpus Linguistics

Page 31: Corpus Linguistics Notes

Corpus alignment •  We have so far assumed that parallel corpora means aligned parallel corpora

•  An essential step in the construction and exploitation of parallel corpora

•  Without alignment, we cannot easily determine which sentences in TL are translations of which in SL

•  Corpus alignment makes explicit the information regarding the translation in a parallel corpus, with the aim of finding translation equivalents at different levels (sentence, phrase, word) between the SL and TL texts in a parallel corpus

•  Most multilingual corpus tools only take pre-aligned parallel texts as input in parallel concordancing

Page 32: Corpus Linguistics Notes

Corpus alignment •  Levels of alignment

•  Document level •  Paragraph •  Sentence •  Phrase (multi-word unit) •  Word

•  Sentence alignment is generally the first step to phrase and word alignment

Page 33: Corpus Linguistics Notes

Corpus alignment •  Combined vs. stand-alone format

•  Combined/embedded : the source and translated texts stored in a single text

•  Stand-alone: stored in separate files, with SL and TL segment in each translation equivalent linked together with a unique identifier or pointer

•  Conversion between the two formats is possible •  Different parallel concordancers may have different

requirements

Page 34: Corpus Linguistics Notes

Corpus alignment •  Statistical (probabilistic) approach to sentence alignment

•  Usually based on sentence length in terms of words or characters

•  Linguistic (knowledge/rule-based) approach •  Using morpho-syntactic information to explore similarities

between languages •  Punctuations and “anchor points” •  Achieving more accurate alignment, but necessarily slow

•  Hybrid approach •  Most widely used approach to sentence alignment •  Integrating linguistic knowledge into a probabilistic algorithm

to achieve improved accuracy •  Making use of anchor points

Page 35: Corpus Linguistics Notes

Corpus alignment •  Research of alignment has focused on European

language pairs

•  Sentence alignment among closely related European language pairs has achieved a very high accuracy rate (98%+)

•  But less accurate for typologically different languages such as English and Chinese (ca. 80%+), typically requiring human intervention or post-editing

Page 36: Corpus Linguistics Notes

Corpus alignment •  InterText Editor (with automatic Hunalign)

•  Supporting different operating systems •  Local and networked server •  http://wanthalf.saga.cz/intertext

•  WinAlign in SDL-Trados •  Commercial CAT software tool

•  Uplug corpus tools •  http://sourceforge.net/projects/uplug/?source=dlp

Page 37: Corpus Linguistics Notes

Corpus alignment

Page 38: Corpus Linguistics Notes

Corpus alignment

Page 39: Corpus Linguistics Notes

Parallel concordancing •  ParaConc

•  Commercial software (US$89): http://www.paraconc.com/

•  Unicode compliant •  Semi-automatic alignment •  Computing and highlighting collocation •  Supporting 2-4 aligned parallel texts stored in

separate files

Page 40: Corpus Linguistics Notes

Parallel concordancing

Page 41: Corpus Linguistics Notes

Parallel concordancing •  CUC_Paraconc

•  Freeware tool •  Supporting up to 16 parallel texts store either in

one file or in different files •  Unicode compliant •  Supporting Regular Expression search •  Displaying results in KWIC format, and saving

results either in a single text file or in different files

•  www.fass.lancs.ac.uk/projects/corpus/data/CUC_Paraconc.zip

Page 42: Corpus Linguistics Notes

Parallel concordancing

Page 43: Corpus Linguistics Notes

Parallel concordancing

Page 44: Corpus Linguistics Notes

Parallel concordancing

Page 45: Corpus Linguistics Notes

Parallel concordancing •  Terminology in multilingual corpus linguistics •  Types of corpora used in contrastive and translation

studies •  Relationship between corpus linguistics and

contrastive analysis •  Corpus-based translation studies •  Corpus alignment and parallel concordancing •  Well known and influential corpora

•  www.fass.lancs.ac.uk/projects/corpus/cbls/corpus_survey.pdf

Page 46: Corpus Linguistics Notes

UCCTS conferences •  International conferences on Using Corpora in

Contrastive and Translation Studies •  UCCTS1: China

•  www.lancs.ac.uk/fass/projects/corpus/UCCTS2008Proceedings

•  UCCTS2: UK •  www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings

•  UCCTS3 (jointly with ICLC7): Belgium •  http://www.iclc7-uccts3.ugent.be/

•  UCCTS4: July 2014, Lancaster •  http://ucrel.lancs.ac.uk/uccts4/

Page 47: Corpus Linguistics Notes