Corpus Linguistics - NCCU
Transcript of Corpus Linguistics - NCCU
Outline
• What’s a (linguistic) corpus?– What features?
• What can corpora be used for?– What applications?
• What kinds of corpora are there?– How representative?
– How big?
• How can we look inside corpora?– Practice with some simple query tools
• What does the Latin word corpus mean?
• What are corpora?
• What is linguistics?
• What’s the difference between – Applied linguistics
– Sociolinguistics
– Psycholinguistics
• and– Corpus linguistics?
(Short) Quiz
Simple definition
• A database of language
– Features of a database?
• A more formal definition (McEnery, Xiao & Tono 2005)
in machine-readable form,
A collection of sampled texts, written or spoken,
in various ways.
which may be annotated
5 major uses for linguistic corpora
• Language learning and teaching• Theoretical research on Language and
Linguistics– Including comparative studies
• Literary research and analysis• Language technology• Lexicography• (=dictionary making)
– Cobuild, Longman, … – All learner dictionaries now use corpora
How do you make a dictionary? (what resources…?)
• Use your own intuitions
• Ask all your friends for their intuitions
• Consult other dictionaries
• Read thousands of books
– and take lots of notes• Use a corpus
Why do corpora keep getting bigger? (anyone?)
• Because they can
– Price of storage
– Speed of access
• Representativeness
– Small corpus many examples of common words, maybe
– but…… ?
Lexical distribution
• What’s the most common word in English?
• What % does it make up of a whole corpus?
• The 100 most common words make up __% of all the words in a corpus?
• The 7500 most common words make up __%
• Answers:– the, 45% and 90%
• So: – you need massive corpora, if you want to really
represent rare words properly
Taiwan, Dec 2006
Three ages of corpus
research (in lexicography)
Kilgarriff, Lexical ComputingSlide: 14
Pre-computer
KWIC concordance (KWIC=?)
Collocational tools
(what’s a collocation?)
Word Sketch (using Sketch Engine)
Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 15
Age 1:
Pre-computer
First Oxford
English (1860)
Dictionary:
• 20 million
index cards
– a word (usually
rare) and a citation
Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 16
Age 2: KWIC Concordance1 arity, which will be used to take a party of under-privileged children to D
2 from outside. You are invited to a party and after a couple of drinks you d
3 tion, we believe politicians of all parties will listen to our views. &equo
4 ould be reaching agreement with all parties concerned, as to which events,
5 lack people. I have certainly been party to one or two discussions amongst
6 . These should be discussed by both parties before entering into the relatio
7 presents They had hosted a cocktail party at Kensington palace, for example
8 akes. By midnight the end-of-course party is in full swing, but most cadet
9 e should be a right for the injured party to terminate the contract. A mana
10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh
11 s. Ahead I could see the rest of my party plodding towards the final slope t
12 cial ethic. The two main political parties - the Tories and the Liberals -
13 ritish successes in Perth The small party of British players competing in th
14 to help control. One member of the party went to summon the rescue team and
15 rket society fashion magazine. The party was held at his flat which was a l
16 security and secrecy than any Tory Party Conference : it seems that bootleg
Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 17
Age 2: KWIC Concordances
From 1980
Computerised
COBUILD project was innovator
the coloured-pens method
Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 18
1 political association 4 person in an agreement/dispute
2 social event 5 to be party to something...
3 group of people
1 arity, which will be used to take a party of under-privileged children to D
2 from outside. You are invited to a party and after a couple of drinks you d
3 tion, we believe politicians of all parties will listen to our views. &equo
4 ould be reaching agreement with all parties concerned, as to which events,
5 lack people. I have certainly been party to one or two discussions amongst
6 . These should be discussed by both parties before entering into the relatio
7 presents They had hosted a cocktail party at Kensington palace, for example
8 akes. By midnight the end-of-course party is in full swing, but most cadet
9 e should be a right for the injured party to terminate the contract. A mana
10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh
11 s. Ahead I could see the rest of my party plodding towards the final slope t
12 cial ethic. The two main political parties - the Tories and the Liberals -
13 ritish successes in Perth The small party of British players competing in th
14 to help control. One member of the party went to summon the rescue team and
15 rket society fashion magazine. The party was held at his flat which was a l
16 security and secrecy than any Tory Party Conference : it seems that bootleg
The coloured pens method
19
Limitation of KWIC analysis
• As corpora get bigger: too much data
– 50 lines for a word: read all
– 500 lines: could read all, takes a long time
– 5000 lines: no
• Instead, create statistical summaries of
word usage
Corpus query tools
• WordSmith
– http://lexically.net
• Xaira
– http://xaira.org
• Sketch Engine
– www.sketchengine.co.uk
Functions of SkE
• KWIC concordance
– Sorting, filtering etc
• Word sketch
• Automatic thesaurus
• Sketch difference– discriminate near-synonyms
23
Types of corpus
• Balanced corpora
– cf Sublanguage corpora
• Multilingual corpora
– Parallel corpora
– cf Formosan
• Spoken corpora
• Web corpora
• Learner corpora
– cf Reference corpora
2009 年1月14日韓國大學華語教師研修課程
英國里茲大學網路語料庫
• Sharoff (2005)
• 2.8億詞
• 介面顯示搭配詞共同出
現的統計
• 查詢結果限於緊鄰的搭
配共現詞彙
– ☑ 開汽車
– X 開過汽車
25
Web corpora
• Several languages available on Sketch Engine
• Virtually no limit on size
• Must be “cleaned.” Meaning?– Lists of names and numbers
– Form data
– HTML
– Pornography
Call types in BT corpus
Primary move type Question. Ask from top to bottom. Stop when you can answer 'yes'
Prob Is there only a description of a problem or situation?
Who Is it a request about who to contact. ( e.g. which BT contact point or number to call?)
Info Is it a request for information or advice (e.g. about BT services, number or account information,
the state of the network, general knowledge, or time)?
Connect Is it a request to be connected to another agent, service, person or organisation?
Action Is it a request for operator action (e.g. named service; change to BT records or customer service
options; initiation of a BT process such as line test; report a fault)
Other Everything else
28
Annotation (=tagging)
• East Asian corpora must be segmented (斷詞)
• Annotation should be XML compliant
– But isn’t always
• No annotation:
– Can search for patterns
– Still useful
29
What to annotate?
• The most basic labels?
• POS
• Anything else?
• Genre, authorship, register…
• In a learner corpus?
• Error code, student level, mother tongue
• In my PhD spoken corpus
• Move type
TEDDCLOG: an automatic gap-fill
generatorSimon Smith
National Chengchi University, Taipei
Adam Kilgarriff, Lexical Computing Ltd, UK
Thanks to Adam Kilgarriff of Lexical Computing Ltd and Scott Sommers of Ming Chuan University for
co-authoring the full paper; to Wu Guang-zhong of NCCU for help with preparation, and to Jason Chen
for programming
TEDDCloG algorithm flow
passiveThesaurus
module
Children are
particularly
vulnerable to the
effects of passive smoking.
Diffs moduleConcordance module
Text processing
module
Children are particularly
vulnerable to the effects of ___ smoking.
(a)mechanical (b) passive(c)neutral (d)autonomous
neutral
autonomous
mechanical
neutral smoking xautonomous smoking x mechanical smoking x
passive smoking √
Less useful test item
• Now there's Alexander off with erm, with ___ pox.
(1) beef
(2) lamb
(3) vegetable
(4) chicken
Endangered languages
• SOAS archive at
– http://elar.soas.ac.uk/catalogue
– No access to data yet, only catalogue
• Formosan language archives
– Multilingual
– Many Formosan languages
– English and Chinese translations
– Spoken corpus
BibliographyChomsky, N. 1962. Transformational Approach to Syntax. Third Texas
Conference on Problems of Linguistic Analysis in English May 9-12, 1958, Studies in American English. Austin: The University of Texas. 124–158
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini, editors, WaCky! Working papers on the Web as Corpus. Gedit, Bologna.
Smith, S, 2009. Corpora in the classroom: Data-Driven Learning for Freshman English. Proceedings of 2009 International Conference and Workshop on TEFL & Applied Linguistics, March 2009, Ming Chuan University.
Smith, S., Scott Sommers, Adam Kilgarriff (2008). Automatic cloze generation: getting sentences and distractors from corpora. Proceedings of 8th Teaching and Language Corpora Conference, Lisbon
Smith, S, 2003. Predicting query types by prosodic analysis. Unpublished PhD dissertation, University of Birmingham, UK.
Kilgarriff, A., Rychlý, P., Smrž, P. & Tugwell, D. 2004. The Sketch Engine. Proceedings of EURALEX, Lorient, France, July 2004.
McEnery, A., Z. Xiao & Y. Tono (2005) Corpus-based Language Studies: An advanced resource book. London: Routledge.
Zeitoun, Elizabeth, Yu Ching-hua and Weng Cui-xia. 2003. The Formosan Language Archive: development of a multimedia tool to salvage the languages and oral traditions of the indigenous tribes of Taiwan. Oceanic Linguistics 42.1: 218-232.