Corpus design & analysis techniques 1. Monolingual: general, specialized, comparable ...
-
Upload
alicia-lewis -
Category
Documents
-
view
213 -
download
1
Transcript of Corpus design & analysis techniques 1. Monolingual: general, specialized, comparable ...
Corpus design &
analysis
techniques
1
Types of corpora
Monolingual: general, specialized, comparable
Bi/Multilingual: parallel, comparable
2
Type of analysis
Intralingual Interlingual (cross-linguistic)
Number of languages
Monolingual(1 language)
Bilingual/multilingual(2+ languages)
Corpus design (1) MONOLINGUAL1 corpus
(2a) COMPARABLE2+ corpora
(2b) COMPARABLE2+ corpora
(3) PARALLEL2+ corpora
Type typical linguistic corpus
translation-driven corpus
translation-driven corpus translation corpus
Number of languages
1 language 1 language 2 or more languages 2 or more languages
Corpus content
non-translated language A
translated versus non-translated language A
non-translated language A and B
non-translated language A aligned with translation in B
What may be examined
legal language against other genres
translated language against non-translated one
differences and similarities between languages
translation process
3
Monolingual corpora
Monolingual corpus: it is the most typical corpus used by linguists. It contains non-translated texts created only in one language. It involves intralingual analysis, within a single language, for example for descriptive purposes, but also to compare legal language against everyday language or other genres if a reference corpus is used. This type of corpus is mainly used within forensic linguistics, but also in monolingual lexicography and in foreign language teaching to prepare study materials, as is the case with the Cambridge Corpus of Legal English, a 20-million-word collection of legal books and newspaper articles compiled by Cambridge University Press.
4
Comparable corpora
Comparable corpora: It is a set of at least two monolingual corpora which may involve one language (a) or at least two languages (b). Zanettin refers to them as “translation-driven corpora” since their design is motivated by translation research or training yet they do not contain source texts (STs) and corresponding target texts (TTs) (2000: 106). Monolingual comparable corpora: they contain a corpus of
translations and a corpus of texts created spontaneously in the same language (non-translated language). The main object of analysis is how the translated language differs from the non-translated language (to be discussed later as the ‘textual fit’). An example of such corpora is the Translational English Corpus at the University of Manchester. This type of corpora is used in translation studies.
Bilingual or multilingual comparable corpora: they do not contain translated language but spontaneously created texts in two different languages. It is a set of two monolingual corpora designed according to a similar criterion and is used for cross-linguistic analysis. In addition to translation studies, this type of corpora is typically associated with contrastive and comparable linguistics. An example of comparable corpora is the BOnonia Legal Corpus, BoLC, at the University of Bologna, with the Italian legal subcorpus of 33.5 m words and the English legal corpus of 21 m words.
5
Parallel corpora Parallel corpus is a translation corpus in the strictest
sense. It is bilingual or multilingual and may be bi-directional. It contains STs aligned with their translations. Alignment makes parallel corpora more time-consuming to build and, as a result, they are rather seldom found. Examples include: the MultiJur Multilingual Corpus of Legal Texts at the University of Helsinki, legal sections of the CLUVI Parallel Corpus at the University of Vigo (Galician-Spanish, Basque-Spanish) and the GENTT Corpus of Textual Genres for Translation at the Jaume I University. This type of corpus is mainly used for research into the translation process and in applied translation studies: to prepare dictionaries, extract terms for terminological databases, train information extraction software, and train translators.
6
Examples of corpora Narodowy Korpus Języka Polskiego
http://nkjp.pl/index.php?page=6&lang=0 Korpus Języka Polskiego PWN http://korpus.pwn.pl/
(pełny bezpłatny dostęp w BUG Oliwa) Korpus Języka Polskiego IPI PAN
http://korpus.pl/index.php?page=poliqarp British National Corpus:
http://www.natcorp.ox.ac.uk/ (100 million word collection, spoken 10%, written 90%)
Proceedings of the Old Bailey, London's Central Criminal Court 1674-1913 http://www.oldbaileyonline.org/
Korpus równolegly JRC-Acquis Multilingual Parallel Corpus
http://langtech.jrc.it/JRC-Acquis.html - korpus PL ok. 30 mln slow
7
Corpus analysis software
CORPUS SOFTWARE Opis roznych programow
http://www.uow.edu.au/~dlee/software.htm Monolingual Comparison of KfNgram, N-Gram Phrase Extractor,
Wordsmith: http://llt.msu.edu/vol10num1/review3/default.html
Wordsmith: http://www.lexically.net/wordsmith/ KfNgram:
http://kwicfinder.com/kfNgram/kfNgramHelp.html Lexical Tutor/N-Gram: http://www.lextutor.ca/ Corsis (open-access answer to Wordsmith)
http://sourceforge.net/projects/corsis
8
Principles of corpus design
Purpose Balance Representativeness
Sampling criteria Language variety Time span Full text / extracts Sample size Target audience Overall size Translators represented (e.g. acc. to sociolinguistic
variables: gender, mother tongue)
9
Corpus Analysis Techniques
Wordlists Alphabetical lists Frequency-ranked lists Keywords Lists of clusters
KWIC Concordance Collocates
Statistics: average sentence/word length; type/token rate; lexical denisty
10
Wordlist
What is the purpose of preparing a wordlist?- Make a wordlist & analyse lexical v function
words- Make a batch
Statistics:- Average sentence/word length- Type/token ratio: If a text is 1,000 words long, it
is said to have 1,000 tokens. But a lot of these words will be repeated, and there may be only say 400 different words in the text. ‘Types’ are different words. The ratio between types and tokens would be here 40%.
11
Clusters / lexical bundlesClusters – words which are found
repeatedly together in sequence; recurrent expressions regardless of their idiomacity, and regardless of their structural status N-grams, p-frames (phrase-frames), lexical
bundles, multi-word-units, conversational routines, fixed expressions
4-grams: I don’t think so, I don’t think I, but I don’t think
12
Keywords
Keyword – word that is found to be outstanding in its frequency in a text with reference to its frequency in another, generally larger, text/corpus of texts
Key words are lexemes which have become cognitively salient through their repetitive, unusually frequent use. They characterise a given text in that they “are used over and over in the text and are crucial to the theme or topic under discussion. (...) Key words are most often words which represent an essential or basic concept of the text” (Larson 1984: 177).
13
KWIC concordance
A list of all the occurrences of a specified word or expression in a corpus, set in the middle of one line of context each. KWIC concordances help identify collocates
Collocates – words which occur in the neighbourhood (co-text) of the search word
14
Semantic prosodySemantic prosody refers to the positive
or negative connotative meaning which is transferred to the focus word by the semantic fields of its common collocates (Louw 1993). Stubbs (1995, 1996:173–4) examines collocates of causal verbs and finds in his corpus that the vast majority of collocates of cause are negative, e.g. accident, cancer, commotion, crisis and delay. On the other hand, the verb provide has a positive semantic prosody with collocates care, food, help, jobs, relief and support.
15