Using Corpus

27
Using Corpus for Research 6/15/22

description

Introduction to corpus

Transcript of Using Corpus

Using Corpus for Research

Using Corpus for ResearchFriday, May 22, 2015What is a Corpus?A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2005: 16).The Characteristics of the Corpus ApproachIt is empirical, analyzing the actual patterns of language use in natural texts.It utilizes a large and principled collection of natural texts as the basis for analysis.It makes extensive use of computers for analysis.It depends on both quantitative and qualitative analytical techniques.

What is Corpus Linguistics?It is the study of language based on examples of real life language use (McEnery and Wilson1996:1).It is an area which focuses upon a set of procedures, or methods, for studying language (McEnery andHardie 2012: 1).Research in corpus linguistics has led to the elaboration of better quality learner input and provided researchers and teachers with a wider, finer perspective into language in use (Campoy-Cubillo, Bells-Fortuo and Gea-Valor 2010: 3).Case #1What is the difference between the use ofwill and shall in the simple future tense?Who say I/We shall,(Data from the BNC (British National Corpus)

Case #2If you want to state that something is very good, you say (1) Its fabulous.(2) Its great.Who uses the adjective fabulous?

Who uses the adjective great?

Case #3 Language and CultureCase Study: How has the word marriage been used in Britain?Data: The British Corpora in the Bank of EnglishThe British Corpora in the Bank of English

today = Today tabloids sunnow = The Sun & News of the World tabloids brbooks = British books (scientific and popular) times = Times newspapers brmags = British magazines guard = Guardian newspapers indy = Independent newspapers econ = Economist newspapers brephem = British ephemeral (ads, booklets, etc.) bbc = BBC radio brspok = British spoken language newsci = New Scientist magazinesThe frequency information

What can we infer from the frequency information?The concordance lines

What can we infer from the concordance lines regarding the image of the word marriage?The collocation informationWhat can we infer from the collocations of the wordmarriage?

Refer to the Collocation Information

Divide the words into the following groups:Possessives:Words indicating a sequence or period of time:Words to do with other relationships:Words to do with a marriage ending:Words indicating happiness and success:The English language analysisThe verb end is an ergative verb.

What can we infer?

Further Analysis on Frequency DataReferring to the previous Case Study:Words indicating women (her, she, daughter) are more significantly associated with marriage than those indicating men (even though men and women must get married in equal numbers).Is it because the frequency of the word woman is higher than that of the word man?Which one has the higher frequency, the word husband or wife?Further Analysis on Frequency DataFrequency Datawoman 114,022 man 280,290wife 79,562 husband 52,181

What is your interpretation?Case #4 The Indonesian CorpusApa perbedaan antara penggunaan kata Pria dan kata Wanita?

Case #4 The Indonesian CorpusApa perbedaan antara penggunaan kata Wanita dan kata Perempuan?

Pembuatan Korpusdan Analisis Korpus LuringA corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. (Sinclair 2005: 16).Korpus luring biasanya menggunakan format .txt, dengan coding UTF-8Cara Membuat Korpus LuringJika data dalam format MS-Word, buka file, kemudian ikuti urutan berikut ini: (1) Klik File, (2) Pilih Save as, (3) Pada bagian Save as type, pilih Plain Text, (4) Klik Save, (5) Klik OK.Jika data dalam format PDF, gunakan software AntFileConverter.Jika format PDF tetapi berbentuk image, maka perlu di-convert dulu dengan software lain (misal nitro PDF atau omnipage)Example of a small corpus: a word list analysis on the descriptiveparagraphs written by male and female students at a university

Using AntConcAntConc is a freeware, multiplatform tool for carrying out corpus linguistics research and data-driven learning. It was created by Laurence Anthony, Waseda University, Japan. Website: http://www.antlab.sci.waseda.ac.jp/antconc_index.htmlUsing AntConcAntConc contains seven toolsConcordance Tool: This tool shows search results in a 'KWIC' (KeyWord In Context) format. This allows you to see how words and phrases are commonly used in a corpus of texts.Concordance Plot Tool: This tool shows search results plotted as a 'barcode format. This allows you to see the position where search results appear in target texts.File View Tool: This tool shows the text of individual files. This allows you to investigate in more detail the results generated in other tools of AntConc.Clusters (N-Grams): The N-Grams Tool scans the entire corpus for 'N' (e.g. 1 word, 2 words,) length clusters. This allows you to find common expressions.Collocates: This tool shows the collocates of a search term.Word List: This tool counts all the words in the corpus and presents them in an ordered list. This allows you to quickly find which words are the most frequent.Keyword List: This tool shows the which words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus.Word Types and Word Tokens

Using AWPAntWordProfiler contains two tools for carrying out corpus linguistics research on vocabulary profiling.Vocabulary Profile Tool: This tool shows allows you to generate vocabulary statistic and frequency information about a corpus of texts loaded into the program.File Viewer and Editor Tool This tool allows you to view an individual user file and highlight the different levels of vocabulary in the file using a colour coding. It also shows the overall coverage of different vocabulary levels.Thank you