Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du...
-
Upload
colton-byfield -
Category
Documents
-
view
214 -
download
0
Transcript of Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du...
Learner corpus research -
hands on Tom Cobb
Didactique des langues / éducation
Université du Québec à Montréal
Saturday, October 31
8:15am - 10:15am
lextutor.ca/cv/slrf_09/corpus.ppt
2
Dr. Cobb will provide a "crash course" in carrying out research using learner corpora and small teacher or researcher built corpora generally. He will lead a walk-through of a study he has conducted using corpus data and address the work that had to be done and issues to be resolved at each stage of the study, offering a behind-the-scenes look at how corpus research is carried out. In addition he will display some new and accessible online tools for corpus work, hoping to encourage instructors or researchers from other areas to get some hands-on experience in the learner corpus paradigm.
3
Dr. Cobb will provide a [1] "crash course" in carrying out [1a] research using learner corpora and [1b] small teacher or researcher built corpora generally. He will lead a [2] walk-through of a study he has conducted using corpus data and [2a] address the work that had to be done and [2b] issues to be resolved at each stage of the study, offering a behind-the-scenes look at how corpus research is carried out. In addition he will display some [3] new and accessible online tools for corpus work, hoping to [4] encourage instructors or researchers from other areas to get some hands-on experience in the learner corpus paradigm.
4
LEARNER CORPUS crash course
research using learner corpora or other small corpora
walk-through of a study address the work that had to be done issues to be resolved at each stage
display online tools for corpus work encourage hands-on experience + a bit of context
5
At 10.15 you will know… What a corpus is Why corpus research is important What it has contributed to applied linguistics The uses it can have for researchers … for instructors How to build a corpus Choice points in building a corpus … interpreting a instructors Some tools of corpus analysis How to do a learner corpus study Results from some published studies The future of learner corpus studies
6
Corpora – what Corpora – what are they?are they?
7
What is a corpus? A large collection of language in use,
but Not only large Not necessarily so large
Assembled systematically, according to explicit criteria
of representativeness
How large? Depends on the goal
8
Goals and sizes Linguistics goal - to represent entire
language• 100 million wds still under-represents common
collocations
Pedagogical goal – S`s meet common words, structures
• 1-million-words gives 10 hits for frequent words
Applied linguistics goal – trace an acquisition feature
• 1-200,000 words is common
9
Sub-Goals and sizes Pedagogical goal – S`s meet common grammar and
vocab Grammar – 1 million is adequate
– All structures get many hits Lexis
• Basic vocab – 1 million gives 10 hits @ 2k level
• Main collocations– 1 million gives the main ones
Torrential rain?
• “Raining cats and dogs”? – 1 billion gives 5 hits
• Identify specialist lexis– 200,000 may be enough
10
11
A growth industry
Brown 1970………………..1,000,000 wdshttp://icame.uib.no/brown/bcm.html
BNC 1994 .……………… 100,000,000 wdswww.natcorp.ox.ac.uk
Cambridge Int’l 2002....1,000,000,000 wdswww.cambridge.org./elt/corpus/international_corpus.htm
Plus ANC, Bank of English, Cancode …
12
Design / composition e.g., Brown (1970s)
Page from Lextutor
13
What does a corpus represent? A language as a whole
• BNC
Or a part• Cancode oral, MICASE academic
Or of an individual • Jack London’s collected works
Or a group of individuals–Class of ESL learners
14
How do we read a corpus?
Cannot read it naturally–Defeats the goal
Needs the help of a search technology
concordance index frequency list many others
15
Concordancers
http://www.lextutor.ca/concordancers/concord_e.html
16
Lists
http://www.lextutor.ca/freq/compleat_lister/
17
Indexes
http://www.lextutor.ca/concordancers/text_concord/
18
Corpora – why Corpora – why do we need them?do we need them?
19
Why do we need corpora?
A. Corpus work is sexy
B. We have computers – let’s use them
C. Linguistic intuitions are unreliable
20
Linguistic intuitions are notoriously unreliable
Demo 1: Do you think however is more common in spoken or in written language?
By how much? (3 to 1… etc)
21http://www.lextutor.ca/range/range_corpus/
22
Demo 2: What are the main senses of back and which is most common?
• By what factor?
http://www.lextutor.ca/concordancers/concord_e.html
23
24
25
Demo 3: Can you rank order these roughly by frequency band?
0 - 2k3k - 5k6k - 10k11k-15k
http://www.lextutor.ca/freq/train/
26Try one? http://www.lextutor.ca/freq/train/
27
But not always
Demo 4: Which do you think is more common, man and woman,
or woman and man?
Factor of 10:1, 5:1, 2:1?
Go Live http://www.lextutor.ca/concordancers/concord_e.html
28
Many linguistic intuitions are unreliable
Implicit patterns are extremely slow to extract from input
N. Ellis, J. Hulstijn
… because of the severe limitations on what we can see and remember
… unaided
29
Scientific instrumentation Scientific instrumentation
- - a brief history a brief history
30
Not only linguistic intuitions are problematic
For every appearance,many possibleexplanations
Stand outside on astarry evening, what does it look like?
31
The role of the computer in modern science is well known. In disciplines like physics and biology, the computer's ability to store and process inhumanly large amounts of information has disclosed patterns and regularities in nature beyond the limits of normal human experience. Similarly in language study, computer analysis of large texts reveals facts about language that are not limited to what people can experience, remember, or intuit. In the natural sciences, however, the computer merely continues the extension of the human sensorium that began 200 years ago with the telescope and microscope. But language study did not have its telescope or microscope. The computer is its first analytical tool, making feasible for the first time a truly empirical science of language.
– Cobb 1999
32
Before the computer, linguists could only study small samples of language at a time because of their limitations of their powers of observation and their memories. Even scholars who relentlessly collected instances of usage all their lives only had a few examples of any particular pattern, and there was no way of telling what they had missed.
Sinclair, 2003, p. ix
33
Dr Johnson A Dictionary of the English Language
Longman 1755 Based on quotations from literature
copied onto many slips of paper
But using literature has some problems- Old and recent lit conflated- Is literature truly representative of
life’s typical situations?- Is its lexis «un peu recherché»?
Early corpora
34
120 years later - James Murray, OED 1879 – REAL LANGUAGE examples sent in by post - Oxford City Post Office sets up a special sub-branch for OED
35
Most sciences - supplemented by technologies from 15th century
BIOLOGY..……….microscope ASTRONOMY..…..telescope NAVIGATION.……astrolabe etc
Language study – late 20th century –
….machine readable corpora
36
Thus the “corpus revolution”
Dictionaries Grammars Courses Studies
37
Of particular note…
LGSWE
38
Corpus – successesCorpus – successes
39
Fabled Core of English is close to disclosure
Main lexis + coverage 2000 wd families = 80%, Carrol et al 76
Main collocations in BNC-speech 84 HF collocations belong in 1k list, Shin & Nation 2007
Main phrasal verbs – 25 Ph vbs = 1/3 of all ph vbs in BNC, Gardner & Davies, 2007
Main morphologies Bauer & Nation, 1993
Main stress patterns (Murphy & Kandil)
Cf. All this coming together at the same time as the human genome, also a corpus project
40
Ancient prescriptivism is close to defeated in language pedagogy
Except one debate remains Corpus-based v. corpus-informed approaches
Corpus based If it`s in the corpus times X, it`s OK
X to be defined Corpus informed
Corpus information is one source of information
41
Numerous errors are now corrected (in principle)
Definitions no longer harder than the defined word Simple present no longer automatically the first
verb tense taught Written language no longer the model for spoken
language Status of multi-word units reinstated Grammar no longer taught …
via unknown lexis as unconnected to lexis
42
Task Grammar as connected to lexis? Let’s see what this could mean
+ practice “reading concordances”
Get out “borders on”• (From SInclair http://www.twc.it/)
What is the pattern? What does it mean?
Can we call this ``word grammar``?
43
< Back to full output User extract
041. cember, Karimov became is more than just a way of life – it BORDERS on a religion. But there is of the laws of the sea s 042. n a religion. But there is of the laws of the sea sometimes BORDERS on arrogance. Not only should the international coll 043. ot only should the international collaboration is great and BORDERS on cartel like behaviour. who say using the extremis 044. on cartel like behaviour. who say using the extremist label BORDERS on demagoguery and will only serve Yugoslavia. What 045. ery and will only serve Yugoslavia. What is occurring there BORDERS on genocide. No country or society Careless but losi 046. o country or society Careless but losing two in the one day BORDERS on incompetence. Now Charlie Turkey, the only NATO c 047. competence. Now Charlie Turkey, the only NATO country which BORDERS on Iraq, is playing a key role in Her mastery of the 048. aq, is playing a key role in Her mastery of the short story BORDERS on perfection. kate saunders country’s stagnant grow 049. fection. kate saunders country’s stagnant growth, which now BORDERS on recession. Here again, the challenge looms ugly w 050. ession. Here again, the challenge looms ugly when recession BORDERS on slump. Everybody is on edge, The author, a lifelo 051. incredible. In the case_0 of maxim ‘The collector’s passion BORDERS on the chaos of memories.’ before staged protests at 052. he paranoid and, although and an easy going demeanour which BORDERS on the charismatic, it’s hardly popular music. In so 053. ian province of Kosovo, a professional solicitousness which BORDERS on the dangerous edge of savings accounts versus sha 054. e Soviet Central Asian clash. He said: ‘The hostility there BORDERS on the dangerous.’ Black players and – and to perfor 055. pathological. The sky, a then Claire makes a statement that BORDERS on the downright cocky. When I ask The linear intens 056. the chaos of memories.’ before staged protests at these two BORDERS on the east and west of their speaking to troops in 057. e obsessive. But there is the Sierra Madre” as he dubs them BORDERS on the eccentric. Mountain lions courses and opportu 058. ccentric. Mountain lions courses and opportunities, that it BORDERS on the embarrassing. This the straight, but his winn 059. on the obsessive. He portrays has a streak of bravery which BORDERS on the foolish. She has delicate to buy. A family wi 060. sensational because the amount of work he is required to do BORDERS on the incredible. In the case_0 of maxim ‘The colle 061. rs on the dangerous edge of savings accounts versus shares, BORDERS on the irresponsible. an independent Bosnia in its p 062. the contrary, his private His love for all things maritime BORDERS on the obsessional. He is truly Not surprisingly, th 063. ally acceptable, four even_0 harbour a passion for DIY that BORDERS on the obsessive. But there is the Sierra Madre” as 064. on slump. Everybody is on edge, The author, a lifelong fan, BORDERS on the obsessive. He portrays has a streak of braver 065. right cocky. When I ask The linear intensity of their songs BORDERS on the paranoid and, although and an easy going deme 066. on the surreal. Wander into the The atmosphere of paranoia BORDERS on the pathological. The sky, a then Claire makes a 067. the embarrassing. This the straight, but his winning effort BORDERS on the sensational because the amount of work he is 068. surreal. He had his own most dangerous regions on Earth. It BORDERS on the Serbian province of Kosovo, a professional so 069. lish. She has delicate to buy. A family with three children BORDERS on the socially acceptable, four even_0 harbour a pa 070. east and west of their speaking to troops in Xinjian which BORDERS on the Soviet Central Asian clash. He said: ‘The hos 071. gerous.’ Black players and – and to performing them sort of BORDERS on the surreal. He had his own most dangerous region 072. e obsessional. He is truly Not surprisingly, the atmosphere BORDERS on the surreal. Wander into the The atmosphere of pa 073. arismatic, it’s hardly popular music. In some cases_1, this BORDERS on wholesale plagiarism. That’s * __________________ 074. on the irresponsible. an independent Bosnia in its pre war BORDERS. On the contrary, his private His love for all thing 075. ________________________ and on mutual respect for existing BORDERS” on December, Karimov became is more than just a way
44
Corpus – failuresCorpus – failures
45
And yet…
“The corpus-driven revolution in applied linguistics continues apace, and along with it the paradox that as corpora change the face of applied linguistics (most dictionaries, grammars, and course books now claim to be corpus based) it is largely without the participation of practitioners. Only a few teachers or researchers have ever built a corpus or delved through concordance lines.”
- Cobb 2008, review of CBLS
46
Stalled enterprise (-McCarthy, 2008)
Teachers and researchers need to become producers, not just consumers, of corpus research
Why?To evaluate “corpus based” claims
Often vocab but not grammar is CB, etcWhat kind of corpus?
To effectively lobby to get their CB needs mete.g. Gram+lex of specific domains
To develop their own CB materialsWho still uses a course book?
To build their own corpora for action research projects
47
Stumbling blocksSome intimidation remains attached to corpus work
It is not universally appreciated in SLA - Widdowson
Computer stuff looks daunting
- Seems more linguistics than applied
POLICY OF THIS WORKSHOP:
There are some fairly clear reasons to do this and simple ways to get started
48
… The classic corpora are not easy-access
- Despite long lists on the Web- Even McCarthy’s Cancode is 100% unavailable to
researchers- Ref Tribble review of O’keefe et al
- Especially in languages other than English- Lextutor users’ requests for German =>
Solutions <= [1] Band together (CECL) - [2] Make your own =>
49
DIY corpus – why?DIY corpus – why?
50
German http://www.lextutor.ca/concordancers/braun_info.html
51
Why bother – Google is a corpus
Ref – Robb
52
53
Classic case, breadth v. depth
Web-as-corpus gives massive volume
Even smallish DIY corpus givesBetter quality search
Families, starts with, ends with
Easier access to detail & context
Better exposure to pattern
+ you can make your own, target your own needsMaterial for learners
Material from learners
v. corpus
54
DIY corpus – how?DIY corpus – how?
55
Build your own - HOW Many texts on the Web
E.g., http://www.lextutor.ca/bookbox/ Question of selection replaces quesiotn
of access
Must be or become text files (whatever.txt) «dot txt
Whether you want a one-big-file corpus Or several-small-files corpus
56
Only plain .TXT files make corpora
One
57
One big file: a) Insert
One
58
One big file: b) Upload http://www.lextutor.ca/tools/corpus_builder2/
One
59
DIY corpus for DIY corpus for learning materialslearning materials
60
Using CB tools to select / develop learning materials?
Using news texts?Check first against CB frequency lists
Pre-teaching vocab?Find the CB keywords
Writing tests?Check it contains gram+lex the S’s have actually
seen
Teaching a speaking course?Check models are speech not writing
61
Build corpus as learning materials
For some purpose
Must make some sampling sense
EG one London – all London
All course materials
Corpus of graded readers
62
Learning materials – multi-file corpushttp://www.lextutor.ca/callwild
63
Learning materials – one-file corpushttp://conc.lextutor.ca/list_learn/eng/
64
Learning materials – one-file corpushttp://www.lextutor.ca/corpus_grammar/
65
DIY for research DIY for research purposespurposes
66
1. Written 1. Written production production
67
Learner text more and more available
- Collect & investigate because it is there?
Some typical purposes
- determine needs
- check progress
- Cf. active vs. passive ability
- explore for experimental hypothesis
Constraints
Choose topic carefullyDoes topic suggest just one verb tense?
Cf capital punishment vs. my holidayVery different language demands
68
Models of LCsLearners vs. NSsLs vs. Ls –
Snapshot or Longitudinal (same Ls at diff times)Or diff Ls at diff stages in learning ≅ longitudinal
(Cross-sectional)
ORBelz (04, citing Cobb 03) 4 LC variables should be
controlled: 1. type of learner (e.g., FL vs. SL), 2. stage of learner 3. text type/purpose/register/conditions, 4. and the availability of a similar corpus of native
speaker data
69
NS data must be comparable
Best example is UCLE’s Locness
Louvain Corpus of Native Speaker Essays 149,574 words of argumentative essays written
by American university students 18,826 words of literary-mixed essays written by
American university students 59,568 words of argumentative and literary
essays written by British university students 60,209 words of British A-level argumentative
essays.
70
Issues in LCSMALL ISSUES –
Tag or not?
Spell check or not, or at what point?
One file or many?
BIG ISSUE - Granger 2004, p. 124
What kind of data is a LC?
“LC typically fall into the category of natural or open-ended data” while “SLA researchers tend to prefer [1] introspective or [2] experimental/elicited data…”
V BIG ISSUE -Is this paradigm an instance of Bley-Vroman’s (1983) “comparative fallacy”?
71
Once made, flat or tagged? Pro’s of flat corpus
If for learning materials, = what learners face• THEY must make sense of data• Tagged does it for them
Easier to make, you can have more Search inputs require some work, Trial +error
Pro’s of tagged corpus Precise comparisons are possible
Especially for N-N compounds and errors
But learner data poses special problems Tags are needed for error analysis
• VP + ADV + D OBJ, etc Yet learner data confuses taggers
72
Error tagger (UCL Err Extractor – Granger 02)specific-purpose, known-target tagging - Unlikely to confuse tagger, but a ton of work
73
Here’s a set of studies I’m working on
LC study typically begins with a practical problemTheoretical conundrums? not so much
E.g., this problem:Montreal learners
Eight years ESL
At 18 many switch to English-language system
With insufficient vocabulary for advanced study in English
Fully competent only at 1k
74
Biq question
Input: What lexis are these kids getting in school?
RQ
Do their NNS teachers have enough vocab themselves to get kids over the 1k-hump?
75
ProcedureRun Vocab size test on Ts
Nations’s new 14k – lextutor.ca/tests/
Get small exploration corpus of their production“How could the TESL program be improved?”
Argumentative + opinion
Get similar sized NS corpusLOCNESS, A-Levels, UK
“An invention that has changed how we live”
Compare for structure and lexisQuantity (frequency) and qualityFocus on lexis 2k+
76
77
PreludeLook at TESLProg.txt in your handout
as demo mini-corpus
Writing task was this 5-minute in-class writing exercise
Peter Elbow, keep writing idea Discursive topic
How could UQAM new TESL program be improved? Homework:
- identify your main point - focus + elaborate for Web publication
Each paper gets three rounds of feedback
78
Computers have become a huge part of our lives in both the areas of work and education. But are they such a good thing? When calculators came along a drop in ability of students for mental arithmetic was obvious and now they are used for the simplest calculations. The computer could do the same thing. Computers encourage laziness in the general public, why work out something yourself when the computer can do it for you. This is very time saving and efficient but it is causing people to forget basic ideas. For instance, spelling is no longer as important as it was you can simply use a "spellcheck" to correct your English, which is absurd. For the youth of today computers offer links around the world and millions of facts and figures. This could be argued to be educational. However, this is killing the imagination of children and they spend hours sat at a keyboard tapping away in the doom and gloom of the house. They should be out enjoying themselves and gaining experiences for themselves instead of reading about them on a flat screen. It is said that you can meet people through computers and have `relationships'. I find this preposterous and people are losing the ability to communicate and form relationships.
Comparison text from Locness (ex 1)
79
Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand and calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people.
Comparison corpus from Locness (2)
More lexis? Less? A little? A lot?http://www.lextutor.ca/vp/bnc/
Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand and calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people.
http://www.lextutor.ca/vp/bnc/
Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand and calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people.
80
Which analysis software?
81
Basic structure snapshot (Qc corpus)
http://www.lextutor.ca/concordancers/text_concord
82http://www.lextutor.ca/concordancers/text_concord
83
84http://www.lextutor.ca/tuples/eng/
85
Lexis comparison
86
Lexis comparison
NNS corpus (Quebec TESL trainees)
155 post-1k word families/3356 tokens
NS corpus(UK A-Levels essay)
269 post-1k word families/3630 tokens
But that’s not allSplit up corpus
Look at individuals
http://www.lextutor.ca/vp/bnc/
87
88Almost all post-2ks are used by one writer only
89
ConclusionInteresting peripheral differences for another study
Syntax correct but unelaborated
Phrases heavy on the short end,
light on the long endLow proportion of noun-noun
Vocab - Heavy reliance on 1k vocabLow Post-1k
Items used by one person
Yet good recognition scores at 3k+ levels Known words are not getting used Unlikely to get used in classroom
90
2. Oral 2. Oral production corpus production corpus
91
Let’s learn more about the previous study:
Follow trainees into their classrooms
Does the predicted pattern occur?If new words appear, are they recycled?
*See Horst’s Teacher Talk Corpus study in a forthcoming RIFL (2011)
(Note: Different subjects – here we are establishing tools & method)
92Looks like rich lexical input…
18 hrs of NS-T classroom talk
93
Summary
Post-1k words (learning zone) 1570 families 900 appear in one class-hour only
Inc 300 one TIME only
«Recyclage» is not happening Now add this to the NNS data
Few post-1k used in own writing The problem starts to make sense
94
Or, Alert’s 108,000 wds, nopasttense!
Went, sawhttp://www.lextutor.ca/concordancers/concord_e.html
95
3. Goal 3. Goal clarificationclarification
96
Let’s work through a published study
Ovtcharov & Cobb 2006 (en français)
Situation: Ottawa
Civil service promotions depend on success in L2 oral interview
Pass/fail evaluated globally (=impressionistically)
“A well developed vocabulary” is one of the stated criteriaBut what is it?
The usual soft focus
97
Needed for the study1. Corpus of transcribed oral interviews
Both passes, fails, & borderlines24 of each, 25-35 minutes
100s of hours work
2. French version of VocabprofileLemmatized large-corpus based, k-leveled frequency lists?
Miraculously appear in c. 2001See Cobb & Horst, 2004
3. Usable NS reference corpusProvided by Beeching, 2001
French oral interviews in USA
98
Identifiable difference at 2kStrong difference at 3k+MHL (off-list)
Result
99
(Assuming replication)
One less failure-to-communicate in the vastness of high-stakes language instruction
The instructional design process has a place to begin
Significance
100
Corpus research is a fairly simple, bean-counting type of research
That can solve complex problems in language learning & teaching, both
PracticalWhat do these people need to learn?
Can examiners’ impressions be operationalized?
Theoretical E.g., Piecing together the portrait of advanced
interlanguage (Cobb 2003)
So…
101
Course tie-upCourse tie-up
102
At 10.15 you now know… What a corpus is Why it is important What insights it has yielded in applied
linguistics The uses it can have for researchers … for instructors How to build a corpus Choice points in building a corpus Some tools of corpus analysis How to do a learner corpus study The results of some published learner
corpus studies The future of learner corpus studies
103
The FutureThe Future
104
Corpus research carries on shining the light into dark corners- 2007-2009 work from Dee Gardner, Stuart Webb
Some increase in corpus awareness- Teacher training programs
- MA methods courses
Collaboration reduces labour- CECL, the Locness reference corpus
- Promise of automatic corpus comparisons at Calper Gold
Dev. world can play as tools go online
Where do we go from here?
105
If we have time…The final challenge
to the utility of frequency lists
As already seen
We are closing in on the Core of EnglishThis includes a smaller than expected group of
true homonyms
No corpus tool-kit so far deals with these systematicallyE.g. a Vocabprofile analysis does not distinguish bank
and bank
106Go livehttp://www.lextutor.ca/concordancers/text_concord
[email protected] www.lextutor.ca
This PPT at http://www.lextutor.ca/cv/slrf_09/corpus.ppt
References list at http://www.lextutor.ca/cv/slrf_09/handout.doc