Types of Corpora

download Types of Corpora

of 2

Transcript of Types of Corpora

  • 7/27/2019 Types of Corpora

    1/2

    MODELOSGRAMATICALES DELINGLS Juan Santana Lario

    Tfno 958 241000 - Ext. 20243Fax 958 243678.

    [email protected] www.ugr.es/local/jsantana

    MODELOSGRAMATICALES DELINGLS | TYPES OF CORPORA 1

    MODELOS GRAMATICALES. Corpus Linguistics

    2. Types of corpora According to purpose :

    o General-purpose corpora : designed as a resource for a general representation of thelanguage and to serve and the basis for a wide range of varied linguistic studies: Brown,LOB (Lancaster-Oslo/Bergen corpus), BNC (British National Corpus).

    o Domain- specific (or sub -language) corpora : represent a specific variety (whether regional, temporal, language domain, etc.) and/or are intended for specific purposes(language teaching, dictionary making, translation studies, etc.): Guangzhou PetroleumEnglish Corpus, JDEST Computer Corpus of Text in English for Science and Technology

    According to text selection procedure :o Sample corpus : it consists of sections of texts (samples) o f approximately same length

    representing a variety of text categories (balancing, representativeness). Eg: Brown, LOB(Lancaster-Oslo/Bergen corpus), SEU (Survey of English Usage corpus),). Brown and LOB: 15text categories, 500 samples, 2000 words per sample

    o Full-text corpora : consists of full texts. Eg: English Poetry Full-Text Database

    Open / Close character:o Closed/static corpus: once the corpus is completed no more texts are added. Eg: all the

    corpora above.o Open/dynamic corpus. monitor corpus or textbank: new materials are continually

    added, older materials are discarded: balance between different types is maintained. Eg:Bank of English (University of Birmingham) (originally compiled to produce the CoBuildDictionary).

    o Collections: not exactl y corpora (lack of explicit design/purpose) but large sets of texts.Eg: Oxford Text Archive, LDC (Linguistic Data Consortium), Project Gutenberg.

    According to Medium:o Written corpora: only written texts. Eg: Brown, LOB.o Spoken corpora: Eg : LLC (London-Lund Corpus ): spoken section of SEU: million words of

    British English speech with detailed transcription by means of a prosodic notation showingfeatures such as stress and intonation; SEC (IBM/Lancaster Spoken English Corpus ): 50.000words, various versions: orthographically transcribed, prosodically transcribed,grammatically tagged, sound-recorded; Canadian Hansard : official record of theproceedings of the Canadian House of Commons, over 60 million words, French andEnglish version;MARSEC (Machine Readable Spoken English Corpus) : each string in theorthographic transcription is linked to the corresponding section in the audio recording;COLT (Bergen Corpus of London Teeange Language): collected in 1993, it consists of thespoken language of 13 to 17-year-old teenagers from different boroughs of London; half amillion words orthographically transcribed and word-class tagged; it is a constituent of theBNC.

    o Mixed corpora: both written and spoken material. Eg: Birminghan Bank of English, BNC(British National Corpus), ICE (International Corpus of English)

    According to number of languages/dialects represented:o Monolingual corpora : texts in one language (or language variety) only. Eg.: all of the

    above except for the Canadian Hansardo Multilingual or parallel : more than one language/dialect. Parallelism comes in various

    degrees: from the strictly parallel (original and one or more translated versions of the sametexts: Canadian Hansard, English-Norwegian Parallel Corpus; very useful for lexicography,language teaching and translation studies) to the loosely parallel (comparable corpora)ie a collection of "similar" texts in different languages or in different varieties of a language.:

    ICE (International Corpus of English) : texts compiled in 15 countries where English is the firstor an official second language on the basis of exactly the same compilation principles;taken together the Brown (American English), LOB (British English), and Kolhapur (IndianEnglish) could considered as comparable corpora

  • 7/27/2019 Types of Corpora

    2/2

    MODELOSGRAMATICALES DELINGLS Juan Santana Lario

    Tfno 958 241000 - Ext. 20243Fax 958 243678.

    [email protected] www.ugr.es/local/jsantana

    MODELOSGRAMATICALES DELINGLS | TYPES OF CORPORA 2

    According to temporal variety:o Synchronic: 1 variety, normally contemporary (at compilation time). o Diachronic: Helsinki Corpus

    According to type of speaker: native vs learner corpora According to annotation:

    o Plain : e.g. Project Gutenberg texts, produced by scanning; no information about text(usually, not even edition): not really a corpus but a collection of texts.

    o Annotated: marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics,

    etc.: Brown annotated with identifying information, e.g. edition date, author, genre, register, etc. :

    BNC, ICE-BG annotated for part of speech, syntactic structure, discourse information, etc. : LOBTAG,

    BNC, ICE-GB

    For a comprehensive list of corpora and links to them, visit:

    http://www.uow.edu.au/~dlee/CBLLinks.htm

    http://www.ugr.es/~pedrou/

    http://www.uow.edu.au/~dlee/CBLLinks.htmhttp://www.uow.edu.au/~dlee/CBLLinks.htmhttp://www.ugr.es/~pedrou/http://www.ugr.es/~pedrou/http://www.ugr.es/~pedrou/http://www.uow.edu.au/~dlee/CBLLinks.htm