Corpus Linguistics - NCCU

39
Corpus Linguistics Statistical tools for analyzing language Simon Smith

Transcript of Corpus Linguistics - NCCU

Corpus LinguisticsStatistical tools for analyzing

language

Simon Smith

Outline

• What’s a (linguistic) corpus?– What features?

• What can corpora be used for?– What applications?

• What kinds of corpora are there?– How representative?

– How big?

• How can we look inside corpora?– Practice with some simple query tools

(Short) Quiz

• What does the Latin word corpus mean?

• What are corpora?

• What is linguistics?

• What’s the difference between – Applied linguistics

– Sociolinguistics

– Psycholinguistics

• and– Corpus linguistics?

(Short) Quiz

Simple definition

• A database of language

– Features of a database?

• A more formal definition (McEnery, Xiao & Tono 2005)

in machine-readable form,

A collection of sampled texts, written or spoken,

in various ways.

which may be annotated

5 major uses for linguistic corpora

• Language learning and teaching• Theoretical research on Language and

Linguistics– Including comparative studies

• Literary research and analysis• Language technology• Lexicography• (=dictionary making)

– Cobuild, Longman, … – All learner dictionaries now use corpora

How do you make a dictionary? (what resources…?)

• Use your own intuitions

• Ask all your friends for their intuitions

• Consult other dictionaries

• Read thousands of books

– and take lots of notes• Use a corpus

Corpus linguistics as a tool

• Or, an approach

• What is the alternative?

• Chomsky (1962) said

Why is this no longer really true?

Why do corpora keep getting bigger? (anyone?)

• Because they can

– Price of storage

– Speed of access

• Representativeness

– Small corpus many examples of common words, maybe

– but…… ?

Lexical distribution

• What’s the most common word in English?

• What % does it make up of a whole corpus?

• The 100 most common words make up __% of all the words in a corpus?

• The 7500 most common words make up __%

• Answers:– the, 45% and 90%

• So: – you need massive corpora, if you want to really

represent rare words properly

Taiwan, Dec 2006

Three ages of corpus

research (in lexicography)

Kilgarriff, Lexical ComputingSlide: 14

Pre-computer

KWIC concordance (KWIC=?)

Collocational tools

(what’s a collocation?)

Word Sketch (using Sketch Engine)

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 15

Age 1:

Pre-computer

First Oxford

English (1860)

Dictionary:

• 20 million

index cards

– a word (usually

rare) and a citation

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 16

Age 2: KWIC Concordance1 arity, which will be used to take a party of under-privileged children to D

2 from outside. You are invited to a party and after a couple of drinks you d

3 tion, we believe politicians of all parties will listen to our views. &equo

4 ould be reaching agreement with all parties concerned, as to which events,

5 lack people. I have certainly been party to one or two discussions amongst

6 . These should be discussed by both parties before entering into the relatio

7 presents They had hosted a cocktail party at Kensington palace, for example

8 akes. By midnight the end-of-course party is in full swing, but most cadet

9 e should be a right for the injured party to terminate the contract. A mana

10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh

11 s. Ahead I could see the rest of my party plodding towards the final slope t

12 cial ethic. The two main political parties - the Tories and the Liberals -

13 ritish successes in Perth The small party of British players competing in th

14 to help control. One member of the party went to summon the rescue team and

15 rket society fashion magazine. The party was held at his flat which was a l

16 security and secrecy than any Tory Party Conference : it seems that bootleg

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 17

Age 2: KWIC Concordances

From 1980

Computerised

COBUILD project was innovator

the coloured-pens method

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 18

1 political association 4 person in an agreement/dispute

2 social event 5 to be party to something...

3 group of people

1 arity, which will be used to take a party of under-privileged children to D

2 from outside. You are invited to a party and after a couple of drinks you d

3 tion, we believe politicians of all parties will listen to our views. &equo

4 ould be reaching agreement with all parties concerned, as to which events,

5 lack people. I have certainly been party to one or two discussions amongst

6 . These should be discussed by both parties before entering into the relatio

7 presents They had hosted a cocktail party at Kensington palace, for example

8 akes. By midnight the end-of-course party is in full swing, but most cadet

9 e should be a right for the injured party to terminate the contract. A mana

10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh

11 s. Ahead I could see the rest of my party plodding towards the final slope t

12 cial ethic. The two main political parties - the Tories and the Liberals -

13 ritish successes in Perth The small party of British players competing in th

14 to help control. One member of the party went to summon the rescue team and

15 rket society fashion magazine. The party was held at his flat which was a l

16 security and secrecy than any Tory Party Conference : it seems that bootleg

The coloured pens method

19

Limitation of KWIC analysis

• As corpora get bigger: too much data

– 50 lines for a word: read all

– 500 lines: could read all, takes a long time

– 5000 lines: no

• Instead, create statistical summaries of

word usage

Corpus query tools

• WordSmith

– http://lexically.net

• Xaira

– http://xaira.org

• Sketch Engine

– www.sketchengine.co.uk

Taiwan, Dec 200621

Taiwan, Dec 200622

Functions of SkE

• KWIC concordance

– Sorting, filtering etc

• Word sketch

• Automatic thesaurus

• Sketch difference– discriminate near-synonyms

23

Types of corpus

• Balanced corpora

– cf Sublanguage corpora

• Multilingual corpora

– Parallel corpora

– cf Formosan

• Spoken corpora

• Web corpora

• Learner corpora

– cf Reference corpora

2009 年1月14日韓國大學華語教師研修課程

英國里茲大學網路語料庫

• Sharoff (2005)

• 2.8億詞

• 介面顯示搭配詞共同出

現的統計

• 查詢結果限於緊鄰的搭

配共現詞彙

– ☑ 開汽車

– X 開過汽車

25

Web corpora

• Several languages available on Sketch Engine

• Virtually no limit on size

• Must be “cleaned.” Meaning?– Lists of names and numbers

– Form data

– HTML

– Pornography

Call types in BT corpus

Primary move type Question. Ask from top to bottom. Stop when you can answer 'yes'

Prob Is there only a description of a problem or situation?

Who Is it a request about who to contact. ( e.g. which BT contact point or number to call?)

Info Is it a request for information or advice (e.g. about BT services, number or account information,

the state of the network, general knowledge, or time)?

Connect Is it a request to be connected to another agent, service, person or organisation?

Action Is it a request for operator action (e.g. named service; change to BT records or customer service

options; initiation of a BT process such as line test; report a fault)

Other Everything else

28

Annotation (=tagging)

• East Asian corpora must be segmented (斷詞)

• Annotation should be XML compliant

– But isn’t always

• No annotation:

– Can search for patterns

– Still useful

29

What to annotate?

• The most basic labels?

• POS

• Anything else?

• Genre, authorship, register…

• In a learner corpus?

• Error code, student level, mother tongue

• In my PhD spoken corpus

• Move type

TEDDCLOG: an automatic gap-fill

generatorSimon Smith

National Chengchi University, Taipei

Adam Kilgarriff, Lexical Computing Ltd, UK

Thanks to Adam Kilgarriff of Lexical Computing Ltd and Scott Sommers of Ming Chuan University for

co-authoring the full paper; to Wu Guang-zhong of NCCU for help with preparation, and to Jason Chen

for programming

TEDDCloG algorithm flow

passiveThesaurus

module

Children are

particularly

vulnerable to the

effects of passive smoking.

Diffs moduleConcordance module

Text processing

module

Children are particularly

vulnerable to the effects of ___ smoking.

(a)mechanical (b) passive(c)neutral (d)autonomous

neutral

autonomous

mechanical

neutral smoking xautonomous smoking x mechanical smoking x

passive smoking √

Useful test item from program

Less useful test item

• Now there's Alexander off with erm, with ___ pox.

(1) beef

(2) lamb

(3) vegetable

(4) chicken

Corpora for English learning

Simon Smith

National Chengchi University

DDL: 數據驅動教學

A typical on-line DDL task. Imagine you don’t know common.

Endangered languages

• SOAS archive at

– http://elar.soas.ac.uk/catalogue

– No access to data yet, only catalogue

• Formosan language archives

– Multilingual

– Many Formosan languages

– English and Chinese translations

– Spoken corpus

BibliographyChomsky, N. 1962. Transformational Approach to Syntax. Third Texas

Conference on Problems of Linguistic Analysis in English May 9-12, 1958, Studies in American English. Austin: The University of Texas. 124–158

Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini, editors, WaCky! Working papers on the Web as Corpus. Gedit, Bologna.

Smith, S, 2009. Corpora in the classroom: Data-Driven Learning for Freshman English. Proceedings of 2009 International Conference and Workshop on TEFL & Applied Linguistics, March 2009, Ming Chuan University.

Smith, S., Scott Sommers, Adam Kilgarriff (2008). Automatic cloze generation: getting sentences and distractors from corpora. Proceedings of 8th Teaching and Language Corpora Conference, Lisbon

Smith, S, 2003. Predicting query types by prosodic analysis. Unpublished PhD dissertation, University of Birmingham, UK.

Kilgarriff, A., Rychlý, P., Smrž, P. & Tugwell, D. 2004. The Sketch Engine. Proceedings of EURALEX, Lorient, France, July 2004.

McEnery, A., Z. Xiao & Y. Tono (2005) Corpus-based Language Studies: An advanced resource book. London: Routledge.

Zeitoun, Elizabeth, Yu Ching-hua and Weng Cui-xia. 2003. The Formosan Language Archive: development of a multimedia tool to salvage the languages and oral traditions of the indigenous tribes of Taiwan. Oceanic Linguistics 42.1: 218-232.