Corpus design & analysis techniques 1. Monolingual: general, specialized, comparable ...

Corpus design &

analysis

techniques

1

Types of corpora

Monolingual: general, specialized, comparable

Bi/Multilingual: parallel, comparable

2

Type of analysis

Intralingual Interlingual (cross-linguistic)

Number of languages

Monolingual(1 language)

Bilingual/multilingual(2+ languages)

Corpus design (1) MONOLINGUAL1 corpus

(2a) COMPARABLE2+ corpora

(2b) COMPARABLE2+ corpora

(3) PARALLEL2+ corpora

Type typical linguistic corpus

translation-driven corpus

translation-driven corpus translation corpus

Number of languages

1 language 1 language 2 or more languages 2 or more languages

Corpus content

non-translated language A

translated versus non-translated language A

non-translated language A and B

non-translated language A aligned with translation in B

What may be examined

legal language against other genres

translated language against non-translated one

differences and similarities between languages

translation process

3

Monolingual corpora

Monolingual corpus: it is the most typical corpus used by linguists. It contains non-translated texts created only in one language. It involves intralingual analysis, within a single language, for example for descriptive purposes, but also to compare legal language against everyday language or other genres if a reference corpus is used. This type of corpus is mainly used within forensic linguistics, but also in monolingual lexicography and in foreign language teaching to prepare study materials, as is the case with the Cambridge Corpus of Legal English, a 20-million-word collection of legal books and newspaper articles compiled by Cambridge University Press.

4

Comparable corpora

Comparable corpora: It is a set of at least two monolingual corpora which may involve one language (a) or at least two languages (b). Zanettin refers to them as “translation-driven corpora” since their design is motivated by translation research or training yet they do not contain source texts (STs) and corresponding target texts (TTs) (2000: 106). Monolingual comparable corpora: they contain a corpus of

translations and a corpus of texts created spontaneously in the same language (non-translated language). The main object of analysis is how the translated language differs from the non-translated language (to be discussed later as the ‘textual fit’). An example of such corpora is the Translational English Corpus at the University of Manchester. This type of corpora is used in translation studies.

Bilingual or multilingual comparable corpora: they do not contain translated language but spontaneously created texts in two different languages. It is a set of two monolingual corpora designed according to a similar criterion and is used for cross-linguistic analysis. In addition to translation studies, this type of corpora is typically associated with contrastive and comparable linguistics. An example of comparable corpora is the BOnonia Legal Corpus, BoLC, at the University of Bologna, with the Italian legal subcorpus of 33.5 m words and the English legal corpus of 21 m words.

5

Parallel corpora Parallel corpus is a translation corpus in the strictest

sense. It is bilingual or multilingual and may be bi-directional. It contains STs aligned with their translations. Alignment makes parallel corpora more time-consuming to build and, as a result, they are rather seldom found. Examples include: the MultiJur Multilingual Corpus of Legal Texts at the University of Helsinki, legal sections of the CLUVI Parallel Corpus at the University of Vigo (Galician-Spanish, Basque-Spanish) and the GENTT Corpus of Textual Genres for Translation at the Jaume I University. This type of corpus is mainly used for research into the translation process and in applied translation studies: to prepare dictionaries, extract terms for terminological databases, train information extraction software, and train translators.

6

Examples of corpora Narodowy Korpus Języka Polskiego

http://nkjp.pl/index.php?page=6&lang=0 Korpus Języka Polskiego PWN http://korpus.pwn.pl/

(pełny bezpłatny dostęp w BUG Oliwa) Korpus Języka Polskiego IPI PAN

http://korpus.pl/index.php?page=poliqarp British National Corpus:

http://www.natcorp.ox.ac.uk/ (100 million word collection, spoken 10%, written 90%)

Proceedings of the Old Bailey, London's Central Criminal Court 1674-1913 http://www.oldbaileyonline.org/

Korpus równolegly JRC-Acquis Multilingual Parallel Corpus

http://langtech.jrc.it/JRC-Acquis.html - korpus PL ok. 30 mln slow

7

http://nkjp.pl/index.php?page=6&lang=0

http://korpus.pwn.pl/

http://korpus.pl/index.php?page=poliqarp

http://www.natcorp.ox.ac.uk/

http://www.oldbaileyonline.org/

http://langtech.jrc.it/JRC-Acquis.html

Corpus analysis software

CORPUS SOFTWARE Opis roznych programow

http://www.uow.edu.au/~dlee/software.htm Monolingual Comparison of KfNgram, N-Gram Phrase Extractor,

Wordsmith: http://llt.msu.edu/vol10num1/review3/default.html

Wordsmith: http://www.lexically.net/wordsmith/ KfNgram:

http://kwicfinder.com/kfNgram/kfNgramHelp.html Lexical Tutor/N-Gram: http://www.lextutor.ca/ Corsis (open-access answer to Wordsmith)

http://sourceforge.net/projects/corsis

8

http://www.uow.edu.au/~dlee/software.htm

http://llt.msu.edu/vol10num1/review3/default.html

http://www.lexically.net/wordsmith/

http://kwicfinder.com/kfNgram/kfNgramHelp.html

http://www.lextutor.ca/

http://sourceforge.net/projects/corsis/

Principles of corpus design

Purpose Balance Representativeness

Sampling criteria Language variety Time span Full text / extracts Sample size Target audience Overall size Translators represented (e.g. acc. to sociolinguistic

variables: gender, mother tongue)

9

Corpus Analysis Techniques

Wordlists Alphabetical lists Frequency-ranked lists Keywords Lists of clusters

KWIC Concordance Collocates

Statistics: average sentence/word length; type/token rate; lexical denisty

10

Wordlist

What is the purpose of preparing a wordlist?- Make a wordlist & analyse lexical v function

words- Make a batch

Statistics:- Average sentence/word length- Type/token ratio: If a text is 1,000 words long, it

is said to have 1,000 tokens. But a lot of these words will be repeated, and there may be only say 400 different words in the text. ‘Types’ are different words. The ratio between types and tokens would be here 40%.

11

Clusters / lexical bundlesClusters – words which are found

repeatedly together in sequence; recurrent expressions regardless of their idiomacity, and regardless of their structural status N-grams, p-frames (phrase-frames), lexical

bundles, multi-word-units, conversational routines, fixed expressions

4-grams: I don’t think so, I don’t think I, but I don’t think

12

Keywords

Keyword – word that is found to be outstanding in its frequency in a text with reference to its frequency in another, generally larger, text/corpus of texts

Key words are lexemes which have become cognitively salient through their repetitive, unusually frequent use. They characterise a given text in that they “are used over and over in the text and are crucial to the theme or topic under discussion. (...) Key words are most often words which represent an essential or basic concept of the text” (Larson 1984: 177).

13

KWIC concordance

A list of all the occurrences of a specified word or expression in a corpus, set in the middle of one line of context each. KWIC concordances help identify collocates

Collocates – words which occur in the neighbourhood (co-text) of the search word

14

Semantic prosodySemantic prosody refers to the positive

or negative connotative meaning which is transferred to the focus word by the semantic fields of its common collocates (Louw 1993). Stubbs (1995, 1996:173–4) examines collocates of causal verbs and finds in his corpus that the vast majority of collocates of cause are negative, e.g. accident, cancer, commotion, crisis and delay. On the other hand, the verb provide has a positive semantic prosody with collocates care, food, help, jobs, relief and support.

15

Corpus design & analysis techniques 1. Monolingual: general, specialized, comparable ...

Documents

Transcript of Corpus design & analysis techniques 1. Monolingual: general, specialized, comparable ...