Introduction : corpora, corpus use, and the British National Corpus

47
Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz [email protected]. uk http:// www.natcorp.ox.ac.uk/

description

Introduction : corpora, corpus use, and the British National Corpus. Dr. Ylva Berglund Prytz [email protected] http://www.natcorp.ox.ac.uk/. Outline. Presentation: Corpora, corpus use, and the BNC Demonstration: How to use BNC with Xaira Hands-on: BNC with Xaira - PowerPoint PPT Presentation

Transcript of Introduction : corpora, corpus use, and the British National Corpus

Page 1: Introduction : corpora, corpus use, and the  British National Corpus

Introduction : corpora, corpus use,

and the British National Corpus

Dr. Ylva Berglund [email protected]://www.natcorp.ox.ac.uk/

Page 2: Introduction : corpora, corpus use, and the  British National Corpus

Outline Presentation: Corpora, corpus use, and the

BNC Demonstration: How to use BNC with Xaira Hands-on: BNC with Xaira Presentation: Using the BNC for teaching

and research More hands-on: exploring more Questions and answers

Page 3: Introduction : corpora, corpus use, and the  British National Corpus

At the end of today you should have a basic working knowledge about

corpora and corpus use the BNC Xaira

feel confident using Xaira be able to explore area on your own

know where to turn for help and advice

Page 4: Introduction : corpora, corpus use, and the  British National Corpus

Approaches to linguistic studyIntuition• “Feel” what is

right/wrong/possible

• One person’s language

• Subjective

Study of usage• Examine what is

actually said/written

• Several people• Objective

Page 5: Introduction : corpora, corpus use, and the  British National Corpus

How do you study usage?

Examine naturally occurring language Draw conclusions

Need a sample of language, produced by different people in various contexts

Find a corpus!

Page 6: Introduction : corpora, corpus use, and the  British National Corpus

What is a corpus?

A collection of naturally occurring language data compiled to mirror a language/language variety

(Usually) computer-readable (Usually) contains more than text

(annotation, meta-data)

Page 7: Introduction : corpora, corpus use, and the  British National Corpus

What is a corpus? – some definitions

A corpus can be defined as a collection of texts

assumed to be representative of a given

language. (Tognini-Bonelli 2001: 2)

A corpus is a collection of naturally-occurring

language text, chosen to characterise a state or

variety of language. (Sinclair 1991: 171)

All the material included in a corpus, whether

spoken, written […] is assumed to be taken from

genuine communications of people going about

their normal business. (ibid: 55)

Page 8: Introduction : corpora, corpus use, and the  British National Corpus

How can a corpus help? Look for patterns to see regularities

Quantify

See several examples

Real language – language in use

Based on a variety of sources

Page 9: Introduction : corpora, corpus use, and the  British National Corpus

• Balanced corpora (= Reference or general corpora)

• Specialised corpora Genre-specific, LSP (e.g. English for Academic

Purposes) …

Varieties (dialectal, social, historical)

Learner language, English as a Lingua Franca

• Multilingual corpora

Parallel corpora (translations; alignable)

Comparable corpora (similar texts)

• Fixed size / monitor corpora

• Mode and medium

Written, spoken and transcribed, spoken with audio, video

Types of corpora

Page 10: Introduction : corpora, corpus use, and the  British National Corpus

Famous corpora Brown family (Brown, LOB, FLOB)

1 million words, different text categories Bank of English

Monitor corpus, grows with time International Corpus of English (ICE)

Different national varieties of English. 1 million words written and spoken

British National Corpus Reference corpus, fixed, 100 million words, written

and spoken

Page 11: Introduction : corpora, corpus use, and the  British National Corpus

British National Corpus (BNC)

Page 12: Introduction : corpora, corpus use, and the  British National Corpus

What is the BNC?

A snapshot of British English, taken at the end of the 20th century

100 million words in approx 4,000 different text samples, both spoken (10%) and written (90%)

Synchronic (1960-93), sampled, general purpose corpus

Available under licence; latest edition is BNC XML edition (March 2007)

Page 13: Introduction : corpora, corpus use, and the  British National Corpus

More than text

Metadata About text, author/speaker, audience

Structural & typographical information Paragraph, sentence, heading, list, bolds

Extra-linguistic information Voice quality, noise, pauses, overlap

Linguistic information Part-of-speech

Page 14: Introduction : corpora, corpus use, and the  British National Corpus

Who produced the BNC and why?

a consortium of dictionary publishers and academic researchers OUP, Longman, Chambers OUCS, UCREL, BL R&D

with funding from DTI/ SERC under JFIT 1990-1994

Lexicographers, NLP researchers, But not language teachers!

Page 15: Introduction : corpora, corpus use, and the  British National Corpus

Stated Project Goals A synchronic (1990-4) corpus of samples

both spoken and written from the full range of British English language production

of non-opportunistic design, for generic applicability

with word class annotation and contextual information

Page 16: Introduction : corpora, corpus use, and the  British National Corpus

Actual (?) project goals Better ELT dictionaries

authoritative both speech and writing

A model for European corpus work design, and encoding Industrial-academic co-operation

A REALLY BIG corpus

Page 17: Introduction : corpora, corpus use, and the  British National Corpus

Production of the BNC took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence

of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council

Page 18: Introduction : corpora, corpus use, and the  British National Corpus

Project consequences

industrial-scale text production system necessary compromises? technically over-ambitious? IPR and profitability

The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy

Page 19: Introduction : corpora, corpus use, and the  British National Corpus

How was the corpus created?

Page 20: Introduction : corpora, corpus use, and the  British National Corpus

How was the corpus created?1. Corpus design2. Text selection3. Clearance4. Capture5. Add additional information6. Merge7. (documentation)8. Distribution

Page 21: Introduction : corpora, corpus use, and the  British National Corpus

The BNC “sausage machine”

OUPWritten(OUP/

Chambers)

Spoken(Longman)

Initial CDIF Conversion and Validation

(OUCS)Word Class Annotation

(UCREL)

Header generation and final validation

(OUCS)

Selection, clearance, and capture

Enrichment and encoding

Documentation, distribution, maintenance

Page 22: Introduction : corpora, corpus use, and the  British National Corpus

Text selection1. Design criteria

Types of texts Sources Number of samples Size of samples

2. Descriptive criteria Additional information where available

Page 23: Introduction : corpora, corpus use, and the  British National Corpus

Selection criteria: written texts

Domainimaginative (c 25%)informative

MediumBook, periodicals, misc. published, unpublished, written to be spoken

Time1985-1993(1960-75, 1975-84)

Page 24: Introduction : corpora, corpus use, and the  British National Corpus

“Descriptive” criteria: written texts Sample size (number of words) and extent (start

and end points) Topic or subject of the text Author's name, age, gender, region of origin, and

domicile Target age group and gender "Level" of writing (reading difficulty) : the more

literary or technical a text, the "higher" its level

Page 25: Introduction : corpora, corpus use, and the  British National Corpus

Selection criteria: spoken texts

demographic (spoken conversation) transcriptions of spontaneous natural

conversations made by recruited volunteers original recordings are available from British

Librarycontext-governed (other spoken

material) transcriptions of recordings made at specific

types of meeting and event.

Page 26: Introduction : corpora, corpus use, and the  British National Corpus

Spoken texts: context-governed

Four broad categories of social context: • Educational and informative events, such as

lectures, news broadcasts, classroom discussion, tutorials

• Business events such as sales demonstrations, trades union meetings, consultations, interviews

• Institutional and public events, such as sermons, political speeches, council meetings

• Leisure events, such as sports commentaries, after-dinner speeches, club meetings, radio phone-ins

Page 27: Introduction : corpora, corpus use, and the  British National Corpus

Descriptive criteria: spoken texts Features relating to the speaker (age, sex,

social class, dialect) Context of recording (place, time) Features of the recording (non-verbal

events, paralinguistic phenomena, unclear instances)

Included when known Sometimes provided by respondent

Page 28: Introduction : corpora, corpus use, and the  British National Corpus

What is the result?

Page 29: Introduction : corpora, corpus use, and the  British National Corpus

What is the BNC? 4,000+ texts Ca. 100,000,000 words 10% spoken Information about

the texts the speakers/writers the words

Delivered with a search tool: XAIRA

Page 30: Introduction : corpora, corpus use, and the  British National Corpus

What's in the BNC?

79238146

6175896

4233955 8715786

Spoken Demographic Spoken Context Governed

Books and Periodicals Other written

Page 31: Introduction : corpora, corpus use, and the  British National Corpus

What topics?

17244534

7341163

6574857

3037533

1223783416496420

3821902

14025537

7174152

Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure

Page 32: Introduction : corpora, corpus use, and the  British National Corpus

Post-hoc text-type classification

...sentences

Academic

Literary

Press

Nonfiction

Unpublished

Conversation

OtherSpolen

...words

Page 33: Introduction : corpora, corpus use, and the  British National Corpus

FormatCorpus header (1)

Corpus texts (4,000+)

Text

Text header

<corpus> <corpusHeader></corpusHeader> <corpusText>

<textHeader></textHeader><text></text>

</corpusText> <corpusText>

<textHeader></textHeader><text></text>

</corpusText>

…</corpus>

Page 34: Introduction : corpora, corpus use, and the  British National Corpus

Annotation, encoding, markup• A means of making explicit, and thus

processable: structure

• texts, sections, paragraphs, turns, sentences, words...

metadata • text-type, situational parameters,

context analysis

• morphology, syntactic function, translation

Page 35: Introduction : corpora, corpus use, and the  British National Corpus

Word class annotation CLAWS (Leech, Garside et al) approach What counts as a word?

In BNC-XML, each word is explicitly marked and annotated with a root form or lemma an automatically assigned C5 word class

code a simplified POS code

This isn't prima facie obvious, in spite of spelling conventions.

Page 36: Introduction : corpora, corpus use, and the  British National Corpus

Example: word class annotation

<s n="11"><w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="express" pos="VERB">expressed </w><w c5="PRP" hw="with" pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="method" pos="SUBST">method </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="launch" pos="VERB">launch </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme</w><c c5="PUN">.</c></s>

Page 37: Introduction : corpora, corpus use, and the  British National Corpus

<s n="11"><w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="express" pos="VERB">expressed </w><w c5="PRP" hw="with" pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="method" pos="SUBST">method </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="launch" pos="VERB">launch </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme</w><c c5="PUN">.</c>

</s> c5 = detailed part-of-speechhw = head word (new)pos = simple part-of-speech (new)

Page 38: Introduction : corpora, corpus use, and the  British National Corpus

Some BNC-XML elements

<wtext> or <stext> <div> = section <p> = paragraph or <u> =

utterance <s> = “sentence” <w> = word and <c> = punctuation <mw> = multiword unit

Page 39: Introduction : corpora, corpus use, and the  British National Corpus

What is the markup for?

It makes it possible for you to distinguish aids=SUBST from aids=VERB distinguish occurrences in writing from ones in

speech distinguish occurrences in headings from ones in

paragraphs identify contextual units like sentences and

paragraphs

FACTSHEET WHAT IS AIDS?AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).

Page 40: Introduction : corpora, corpus use, and the  British National Corpus

Who uses the BNC (and how?) Linguists

Research on (English) language Teachers

Reference, Generate teaching materials, In classroom

Publishers Dictionaries, EFL text books

Language engineers Language + computer tools, AI, NLP

Students/language learners Computer scientists

Information retrieval Psychologists/neurologists

General ‘norm’ or reference

LexicographersNLP researchers

Page 41: Introduction : corpora, corpus use, and the  British National Corpus

What makes the BNC so special? Size Design General availability Standardized markup system

Structural annotation Word class annotation Contextual information

Model for other projects

...in these respects, the BNC remains distinctive, twenty years on!

Page 42: Introduction : corpora, corpus use, and the  British National Corpus

How to use the BNC (with Xaira)

Page 43: Introduction : corpora, corpus use, and the  British National Corpus

The BNC can be used in different ways and with different tools User needs to know

What information is available Where/how is information coded

XAIRA can help

Page 44: Introduction : corpora, corpus use, and the  British National Corpus

Search for Words or phrases Word class information Annotation/mark-up

or a combination of them

Page 45: Introduction : corpora, corpus use, and the  British National Corpus

Display Search term with context

with or without mark-up Information about text Collocations (co-occurring words) Distribution across parts of the corpus

and much more

Page 46: Introduction : corpora, corpus use, and the  British National Corpus

XAIRA – XML-aware retrieval application Searches an index of the corpus Uses information in the headers and the

texts Often more than one way to make a search

Can be used with other corpora (if they are indexed first)

Page 47: Introduction : corpora, corpus use, and the  British National Corpus

Introduction : corpora, corpus use,

and the British National Corpus

Dr. Ylva Berglund [email protected]://www.natcorp.ox.ac.uk/