Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du...

107
Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal

Transcript of Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du...

Page 1: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

Learner corpus research -

hands on Tom Cobb

Didactique des langues / éducation

Université du Québec à Montréal

Saturday, October 31

8:15am - 10:15am

lextutor.ca/cv/slrf_09/corpus.ppt

Page 2: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

2

Dr. Cobb will provide a "crash course" in carrying out research using learner corpora and small teacher or researcher built corpora generally. He will lead a walk-through of a study he has conducted using corpus data and address the work that had to be done and issues to be resolved at each stage of the study, offering a behind-the-scenes look at how corpus research is carried out. In addition he will display some new and accessible online tools for corpus work, hoping to encourage instructors or researchers from other areas to get some hands-on experience in the learner corpus paradigm.

Page 3: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

3

Dr. Cobb will provide a [1] "crash course" in carrying out [1a] research using learner corpora and [1b] small teacher or researcher built corpora generally. He will lead a [2] walk-through of a study he has conducted using corpus data and [2a] address the work that had to be done and [2b] issues to be resolved at each stage of the study, offering a behind-the-scenes look at how corpus research is carried out. In addition he will display some [3] new and accessible online tools for corpus work, hoping to [4] encourage instructors or researchers from other areas to get some hands-on experience in the learner corpus paradigm.

Page 4: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

4

LEARNER CORPUS crash course

research using learner corpora or other small corpora

walk-through of a study address the work that had to be done issues to be resolved at each stage

display online tools for corpus work encourage hands-on experience + a bit of context

Page 5: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

5

At 10.15 you will know… What a corpus is Why corpus research is important What it has contributed to applied linguistics The uses it can have for researchers … for instructors How to build a corpus Choice points in building a corpus … interpreting a instructors Some tools of corpus analysis How to do a learner corpus study Results from some published studies The future of learner corpus studies

Page 6: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

6

Corpora – what Corpora – what are they?are they?

Page 7: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

7

What is a corpus? A large collection of language in use,

but Not only large Not necessarily so large

Assembled systematically, according to explicit criteria

of representativeness

How large? Depends on the goal

Page 8: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

8

Goals and sizes Linguistics goal - to represent entire

language• 100 million wds still under-represents common

collocations

Pedagogical goal – S`s meet common words, structures

• 1-million-words gives 10 hits for frequent words

Applied linguistics goal – trace an acquisition feature

• 1-200,000 words is common

Page 9: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

9

Sub-Goals and sizes Pedagogical goal – S`s meet common grammar and

vocab Grammar – 1 million is adequate

– All structures get many hits Lexis

• Basic vocab – 1 million gives 10 hits @ 2k level

• Main collocations– 1 million gives the main ones

Torrential rain?

• “Raining cats and dogs”? – 1 billion gives 5 hits

• Identify specialist lexis– 200,000 may be enough

Page 10: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

10

Page 11: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

11

A growth industry

Brown 1970………………..1,000,000 wdshttp://icame.uib.no/brown/bcm.html

BNC 1994 .……………… 100,000,000 wdswww.natcorp.ox.ac.uk

Cambridge Int’l 2002....1,000,000,000 wdswww.cambridge.org./elt/corpus/international_corpus.htm

Plus ANC, Bank of English, Cancode …

Page 12: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

12

Design / composition e.g., Brown (1970s)

Page from Lextutor

Page 13: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

13

What does a corpus represent? A language as a whole

• BNC

Or a part• Cancode oral, MICASE academic

Or of an individual • Jack London’s collected works

Or a group of individuals–Class of ESL learners

Page 14: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

14

How do we read a corpus?

Cannot read it naturally–Defeats the goal

Needs the help of a search technology

concordance index frequency list many others

Page 15: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

15

Concordancers

http://www.lextutor.ca/concordancers/concord_e.html

Page 16: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

16

Lists

http://www.lextutor.ca/freq/compleat_lister/

Page 17: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

17

Indexes

http://www.lextutor.ca/concordancers/text_concord/

Page 18: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

18

Corpora – why Corpora – why do we need them?do we need them?

Page 19: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

19

Why do we need corpora?

A. Corpus work is sexy

B. We have computers – let’s use them

C. Linguistic intuitions are unreliable

Page 20: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

20

Linguistic intuitions are notoriously unreliable

Demo 1: Do you think however is more common in spoken or in written language?

By how much? (3 to 1… etc)

Page 21: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

21http://www.lextutor.ca/range/range_corpus/

Page 22: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

22

Demo 2: What are the main senses of back and which is most common?

• By what factor?

http://www.lextutor.ca/concordancers/concord_e.html

Page 23: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

23

Page 24: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

24

Page 25: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

25

Demo 3: Can you rank order these roughly by frequency band?

0 - 2k3k - 5k6k - 10k11k-15k

http://www.lextutor.ca/freq/train/

Page 26: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

26Try one? http://www.lextutor.ca/freq/train/

Page 27: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

27

But not always

Demo 4: Which do you think is more common, man and woman,

or woman and man?

Factor of 10:1, 5:1, 2:1?

Go Live http://www.lextutor.ca/concordancers/concord_e.html

Page 28: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

28

Many linguistic intuitions are unreliable

Implicit patterns are extremely slow to extract from input

N. Ellis, J. Hulstijn

… because of the severe limitations on what we can see and remember

… unaided

Page 29: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

29

Scientific instrumentation Scientific instrumentation

- - a brief history a brief history

Page 30: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

30

Not only linguistic intuitions are problematic

For every appearance,many possibleexplanations

Stand outside on astarry evening, what does it look like?

Page 31: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

31

The role of the computer in modern science is well known. In disciplines like physics and biology, the computer's ability to store and process inhumanly large amounts of information has disclosed patterns and regularities in nature beyond the limits of normal human experience. Similarly in language study, computer analysis of large texts reveals facts about language that are not limited to what people can experience, remember, or intuit. In the natural sciences, however, the computer merely continues the extension of the human sensorium that began 200 years ago with the telescope and microscope. But language study did not have its telescope or microscope. The computer is its first analytical tool, making feasible for the first time a truly empirical science of language.

– Cobb 1999

Page 32: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

32

Before the computer, linguists could only study small samples of language at a time because of their limitations of their powers of observation and their memories. Even scholars who relentlessly collected instances of usage all their lives only had a few examples of any particular pattern, and there was no way of telling what they had missed.

Sinclair, 2003, p. ix

Page 33: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

33

Dr Johnson A Dictionary of the English Language

Longman 1755 Based on quotations from literature

copied onto many slips of paper

But using literature has some problems- Old and recent lit conflated- Is literature truly representative of

life’s typical situations?- Is its lexis «un peu recherché»?

Early corpora

Page 34: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

34

120 years later - James Murray, OED 1879 – REAL LANGUAGE examples sent in by post - Oxford City Post Office sets up a special sub-branch for OED

Page 35: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

35

Most sciences - supplemented by technologies from 15th century

BIOLOGY..……….microscope ASTRONOMY..…..telescope NAVIGATION.……astrolabe etc

Language study – late 20th century –

….machine readable corpora

Page 36: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

36

Thus the “corpus revolution”

Dictionaries Grammars Courses Studies

Page 37: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

37

Of particular note…

LGSWE

Page 38: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

38

Corpus – successesCorpus – successes

Page 39: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

39

Fabled Core of English is close to disclosure

Main lexis + coverage 2000 wd families = 80%, Carrol et al 76

Main collocations in BNC-speech 84 HF collocations belong in 1k list, Shin & Nation 2007

Main phrasal verbs – 25 Ph vbs = 1/3 of all ph vbs in BNC, Gardner & Davies, 2007

Main morphologies Bauer & Nation, 1993

Main stress patterns (Murphy & Kandil)

Cf. All this coming together at the same time as the human genome, also a corpus project

Page 40: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

40

Ancient prescriptivism is close to defeated in language pedagogy

Except one debate remains Corpus-based v. corpus-informed approaches

Corpus based If it`s in the corpus times X, it`s OK

X to be defined Corpus informed

Corpus information is one source of information

Page 41: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

41

Numerous errors are now corrected (in principle)

Definitions no longer harder than the defined word Simple present no longer automatically the first

verb tense taught Written language no longer the model for spoken

language Status of multi-word units reinstated Grammar no longer taught …

via unknown lexis as unconnected to lexis

Page 42: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

42

Task Grammar as connected to lexis? Let’s see what this could mean

+ practice “reading concordances”

Get out “borders on”• (From SInclair http://www.twc.it/)

What is the pattern? What does it mean?

Can we call this ``word grammar``?

Page 43: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

43

< Back to full output User extract

041. cember, Karimov became is more than just a way of life – it BORDERS on a religion. But there is of the laws of the sea s 042. n a religion. But there is of the laws of the sea sometimes BORDERS on arrogance. Not only should the international coll 043. ot only should the international collaboration is great and BORDERS on cartel like behaviour. who say using the extremis 044. on cartel like behaviour. who say using the extremist label BORDERS on demagoguery and will only serve Yugoslavia. What 045. ery and will only serve Yugoslavia. What is occurring there BORDERS on genocide. No country or society Careless but losi 046. o country or society Careless but losing two in the one day BORDERS on incompetence. Now Charlie Turkey, the only NATO c 047. competence. Now Charlie Turkey, the only NATO country which BORDERS on Iraq, is playing a key role in Her mastery of the 048. aq, is playing a key role in Her mastery of the short story BORDERS on perfection. kate saunders country’s stagnant grow 049. fection. kate saunders country’s stagnant growth, which now BORDERS on recession. Here again, the challenge looms ugly w 050. ession. Here again, the challenge looms ugly when recession BORDERS on slump. Everybody is on edge, The author, a lifelo 051. incredible. In the case_0 of maxim ‘The collector’s passion BORDERS on the chaos of memories.’ before staged protests at 052. he paranoid and, although and an easy going demeanour which BORDERS on the charismatic, it’s hardly popular music. In so 053. ian province of Kosovo, a professional solicitousness which BORDERS on the dangerous edge of savings accounts versus sha 054. e Soviet Central Asian clash. He said: ‘The hostility there BORDERS on the dangerous.’ Black players and – and to perfor 055. pathological. The sky, a then Claire makes a statement that BORDERS on the downright cocky. When I ask The linear intens 056. the chaos of memories.’ before staged protests at these two BORDERS on the east and west of their speaking to troops in 057. e obsessive. But there is the Sierra Madre” as he dubs them BORDERS on the eccentric. Mountain lions courses and opportu 058. ccentric. Mountain lions courses and opportunities, that it BORDERS on the embarrassing. This the straight, but his winn 059. on the obsessive. He portrays has a streak of bravery which BORDERS on the foolish. She has delicate to buy. A family wi 060. sensational because the amount of work he is required to do BORDERS on the incredible. In the case_0 of maxim ‘The colle 061. rs on the dangerous edge of savings accounts versus shares, BORDERS on the irresponsible. an independent Bosnia in its p 062. the contrary, his private His love for all things maritime BORDERS on the obsessional. He is truly Not surprisingly, th 063. ally acceptable, four even_0 harbour a passion for DIY that BORDERS on the obsessive. But there is the Sierra Madre” as 064. on slump. Everybody is on edge, The author, a lifelong fan, BORDERS on the obsessive. He portrays has a streak of braver 065. right cocky. When I ask The linear intensity of their songs BORDERS on the paranoid and, although and an easy going deme 066. on the surreal. Wander into the The atmosphere of paranoia BORDERS on the pathological. The sky, a then Claire makes a 067. the embarrassing. This the straight, but his winning effort BORDERS on the sensational because the amount of work he is 068. surreal. He had his own most dangerous regions on Earth. It BORDERS on the Serbian province of Kosovo, a professional so 069. lish. She has delicate to buy. A family with three children BORDERS on the socially acceptable, four even_0 harbour a pa 070. east and west of their speaking to troops in Xinjian which BORDERS on the Soviet Central Asian clash. He said: ‘The hos 071. gerous.’ Black players and – and to performing them sort of BORDERS on the surreal. He had his own most dangerous region 072. e obsessional. He is truly Not surprisingly, the atmosphere BORDERS on the surreal. Wander into the The atmosphere of pa 073. arismatic, it’s hardly popular music. In some cases_1, this BORDERS on wholesale plagiarism. That’s * __________________ 074. on the irresponsible. an independent Bosnia in its pre war BORDERS. On the contrary, his private His love for all thing 075. ________________________ and on mutual respect for existing BORDERS” on December, Karimov became is more than just a way

Page 44: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

44

Corpus – failuresCorpus – failures

Page 45: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

45

And yet…

“The corpus-driven revolution in applied linguistics continues apace, and along with it the paradox that as corpora change the face of applied linguistics (most dictionaries, grammars, and course books now claim to be corpus based) it is largely without the participation of practitioners. Only a few teachers or researchers have ever built a corpus or delved through concordance lines.”

- Cobb 2008, review of CBLS

Page 46: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

46

Stalled enterprise (-McCarthy, 2008)

Teachers and researchers need to become producers, not just consumers, of corpus research

Why?To evaluate “corpus based” claims

Often vocab but not grammar is CB, etcWhat kind of corpus?

To effectively lobby to get their CB needs mete.g. Gram+lex of specific domains

To develop their own CB materialsWho still uses a course book?

To build their own corpora for action research projects

Page 47: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

47

Stumbling blocksSome intimidation remains attached to corpus work

It is not universally appreciated in SLA - Widdowson

Computer stuff looks daunting

- Seems more linguistics than applied

POLICY OF THIS WORKSHOP:

There are some fairly clear reasons to do this and simple ways to get started

Page 48: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

48

… The classic corpora are not easy-access

- Despite long lists on the Web- Even McCarthy’s Cancode is 100% unavailable to

researchers- Ref Tribble review of O’keefe et al

- Especially in languages other than English- Lextutor users’ requests for German =>

Solutions <= [1] Band together (CECL) - [2] Make your own =>

Page 49: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

49

DIY corpus – why?DIY corpus – why?

Page 50: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

50

German http://www.lextutor.ca/concordancers/braun_info.html

Page 51: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

51

Why bother – Google is a corpus

Ref – Robb

Page 52: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

52

Page 53: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

53

Classic case, breadth v. depth

Web-as-corpus gives massive volume

Even smallish DIY corpus givesBetter quality search

Families, starts with, ends with

Easier access to detail & context

Better exposure to pattern

+ you can make your own, target your own needsMaterial for learners

Material from learners

v. corpus

Page 54: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

54

DIY corpus – how?DIY corpus – how?

Page 55: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

55

Build your own - HOW Many texts on the Web

E.g., http://www.lextutor.ca/bookbox/ Question of selection replaces quesiotn

of access

Must be or become text files (whatever.txt) «dot txt

Whether you want a one-big-file corpus Or several-small-files corpus

Page 56: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

56

Only plain .TXT files make corpora

One

Page 57: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

57

One big file: a) Insert

One

Page 58: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

58

One big file: b) Upload http://www.lextutor.ca/tools/corpus_builder2/

One

Page 59: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

59

DIY corpus for DIY corpus for learning materialslearning materials

Page 60: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

60

Using CB tools to select / develop learning materials?

Using news texts?Check first against CB frequency lists

Pre-teaching vocab?Find the CB keywords

Writing tests?Check it contains gram+lex the S’s have actually

seen

Teaching a speaking course?Check models are speech not writing

Page 61: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

61

Build corpus as learning materials

For some purpose

Must make some sampling sense

EG one London – all London

All course materials

Corpus of graded readers

Page 62: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

62

Learning materials – multi-file corpushttp://www.lextutor.ca/callwild

Page 63: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

63

Learning materials – one-file corpushttp://conc.lextutor.ca/list_learn/eng/

Page 64: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

64

Learning materials – one-file corpushttp://www.lextutor.ca/corpus_grammar/

Page 65: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

65

DIY for research DIY for research purposespurposes

Page 66: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

66

1. Written 1. Written production production

Page 67: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

67

Learner text more and more available

- Collect & investigate because it is there?

Some typical purposes

- determine needs

- check progress

- Cf. active vs. passive ability

- explore for experimental hypothesis

Constraints

Choose topic carefullyDoes topic suggest just one verb tense?

Cf capital punishment vs. my holidayVery different language demands

Page 68: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

68

Models of LCsLearners vs. NSsLs vs. Ls –

Snapshot or Longitudinal (same Ls at diff times)Or diff Ls at diff stages in learning ≅ longitudinal

(Cross-sectional)

ORBelz (04, citing Cobb 03) 4 LC variables should be

controlled: 1. type of learner (e.g., FL vs. SL), 2. stage of learner 3. text type/purpose/register/conditions, 4. and the availability of a similar corpus of native

speaker data

Page 69: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

69

NS data must be comparable

Best example is UCLE’s Locness

Louvain Corpus of Native Speaker Essays 149,574 words of argumentative essays written

by American university students 18,826 words of literary-mixed essays written by

American university students 59,568 words of argumentative and literary

essays written by British university students 60,209 words of British A-level argumentative

essays.

Page 70: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

70

Issues in LCSMALL ISSUES –

Tag or not?

Spell check or not, or at what point?

One file or many?

BIG ISSUE - Granger 2004, p. 124

What kind of data is a LC?

“LC typically fall into the category of natural or open-ended data” while “SLA researchers tend to prefer [1] introspective or [2] experimental/elicited data…”

V BIG ISSUE -Is this paradigm an instance of Bley-Vroman’s (1983) “comparative fallacy”?

Page 71: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

71

Once made, flat or tagged? Pro’s of flat corpus

If for learning materials, = what learners face• THEY must make sense of data• Tagged does it for them

Easier to make, you can have more Search inputs require some work, Trial +error

Pro’s of tagged corpus Precise comparisons are possible

Especially for N-N compounds and errors

But learner data poses special problems Tags are needed for error analysis

• VP + ADV + D OBJ, etc Yet learner data confuses taggers

Page 72: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

72

Error tagger (UCL Err Extractor – Granger 02)specific-purpose, known-target tagging - Unlikely to confuse tagger, but a ton of work

Page 73: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

73

Here’s a set of studies I’m working on

LC study typically begins with a practical problemTheoretical conundrums? not so much

E.g., this problem:Montreal learners

Eight years ESL

At 18 many switch to English-language system

With insufficient vocabulary for advanced study in English

Fully competent only at 1k

Page 74: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

74

Biq question

Input: What lexis are these kids getting in school?

RQ

Do their NNS teachers have enough vocab themselves to get kids over the 1k-hump?

Page 75: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

75

ProcedureRun Vocab size test on Ts

Nations’s new 14k – lextutor.ca/tests/

Get small exploration corpus of their production“How could the TESL program be improved?”

Argumentative + opinion

Get similar sized NS corpusLOCNESS, A-Levels, UK

“An invention that has changed how we live”

Compare for structure and lexisQuantity (frequency) and qualityFocus on lexis 2k+

Page 76: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

76

Page 77: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

77

PreludeLook at TESLProg.txt in your handout

as demo mini-corpus

Writing task was this 5-minute in-class writing exercise

Peter Elbow, keep writing idea Discursive topic

How could UQAM new TESL program be improved? Homework:

- identify your main point - focus + elaborate for Web publication

Each paper gets three rounds of feedback

Page 78: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

78

Computers have become a huge part of our lives in both the areas of work and education. But are they such a good thing? When calculators came along a drop in ability of students for mental arithmetic was obvious and now they are used for the simplest calculations. The computer could do the same thing. Computers encourage laziness in the general public, why work out something yourself when the computer can do it for you. This is very time saving and efficient but it is causing people to forget basic ideas. For instance, spelling is no longer as important as it was you can simply use a "spellcheck" to correct your English, which is absurd. For the youth of today computers offer links around the world and millions of facts and figures. This could be argued to be educational. However, this is killing the imagination of children and they spend hours sat at a keyboard tapping away in the doom and gloom of the house. They should be out enjoying themselves and gaining experiences for themselves instead of reading about them on a flat screen. It is said that you can meet people through computers and have `relationships'. I find this preposterous and people are losing the ability to communicate and form relationships.

Comparison text from Locness (ex 1)

Page 79: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

79

Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand and calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people.

Comparison corpus from Locness (2)

More lexis? Less? A little? A lot?http://www.lextutor.ca/vp/bnc/

Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand and calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people.

http://www.lextutor.ca/vp/bnc/

Computers may be the future but what part will man have in this future. There will be no need for people to go to school as they could be taught at home, people would hardly ever talk and the only career available would be for computer programmers. I agree that computers are helpful but people should not live through their computers and be so reliant on them. They should read books and live more in order to regain their lost imagination and sense of adventure. Also, in schools I feel that work should be done mainly by hand and calculators and computers should only be used minimally in mathematics in order to stop the production of computer addicts and again have normal people.

Page 80: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

80

Which analysis software?

Page 81: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

81

Basic structure snapshot (Qc corpus)

http://www.lextutor.ca/concordancers/text_concord

Page 82: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

82http://www.lextutor.ca/concordancers/text_concord

Page 83: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

83

Page 84: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

84http://www.lextutor.ca/tuples/eng/

Page 85: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

85

Lexis comparison

Page 86: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

86

Lexis comparison

NNS corpus (Quebec TESL trainees)

155 post-1k word families/3356 tokens

NS corpus(UK A-Levels essay)

269 post-1k word families/3630 tokens

But that’s not allSplit up corpus

Look at individuals

http://www.lextutor.ca/vp/bnc/

Page 87: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

87

Page 88: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

88Almost all post-2ks are used by one writer only

Page 89: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

89

ConclusionInteresting peripheral differences for another study

Syntax correct but unelaborated

Phrases heavy on the short end,

light on the long endLow proportion of noun-noun

Vocab - Heavy reliance on 1k vocabLow Post-1k

Items used by one person

Yet good recognition scores at 3k+ levels Known words are not getting used Unlikely to get used in classroom

Page 90: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

90

2. Oral 2. Oral production corpus production corpus

Page 91: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

91

Let’s learn more about the previous study:

Follow trainees into their classrooms

Does the predicted pattern occur?If new words appear, are they recycled?

*See Horst’s Teacher Talk Corpus study in a forthcoming RIFL (2011)

(Note: Different subjects – here we are establishing tools & method)

Page 92: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

92Looks like rich lexical input…

18 hrs of NS-T classroom talk

Page 93: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

93

Summary

Post-1k words (learning zone) 1570 families 900 appear in one class-hour only

Inc 300 one TIME only

«Recyclage» is not happening Now add this to the NNS data

Few post-1k used in own writing The problem starts to make sense

Page 94: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

94

Or, Alert’s 108,000 wds, nopasttense!

Went, sawhttp://www.lextutor.ca/concordancers/concord_e.html

Page 95: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

95

3. Goal 3. Goal clarificationclarification

Page 96: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

96

Let’s work through a published study

Ovtcharov & Cobb 2006 (en français)

Situation: Ottawa

Civil service promotions depend on success in L2 oral interview

Pass/fail evaluated globally (=impressionistically)

“A well developed vocabulary” is one of the stated criteriaBut what is it?

The usual soft focus

Page 97: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

97

Needed for the study1. Corpus of transcribed oral interviews

Both passes, fails, & borderlines24 of each, 25-35 minutes

100s of hours work

2. French version of VocabprofileLemmatized large-corpus based, k-leveled frequency lists?

Miraculously appear in c. 2001See Cobb & Horst, 2004

3. Usable NS reference corpusProvided by Beeching, 2001

French oral interviews in USA

Page 98: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

98

Identifiable difference at 2kStrong difference at 3k+MHL (off-list)

Result

Page 99: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

99

(Assuming replication)

One less failure-to-communicate in the vastness of high-stakes language instruction

The instructional design process has a place to begin

Significance

Page 100: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

100

Corpus research is a fairly simple, bean-counting type of research

That can solve complex problems in language learning & teaching, both

PracticalWhat do these people need to learn?

Can examiners’ impressions be operationalized?

Theoretical E.g., Piecing together the portrait of advanced

interlanguage (Cobb 2003)

So…

Page 101: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

101

Course tie-upCourse tie-up

Page 102: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

102

At 10.15 you now know… What a corpus is Why it is important What insights it has yielded in applied

linguistics The uses it can have for researchers … for instructors How to build a corpus Choice points in building a corpus Some tools of corpus analysis How to do a learner corpus study The results of some published learner

corpus studies The future of learner corpus studies

Page 103: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

103

The FutureThe Future

Page 104: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

104

Corpus research carries on shining the light into dark corners- 2007-2009 work from Dee Gardner, Stuart Webb

Some increase in corpus awareness- Teacher training programs

- MA methods courses

Collaboration reduces labour- CECL, the Locness reference corpus

- Promise of automatic corpus comparisons at Calper Gold

Dev. world can play as tools go online

Where do we go from here?

Page 105: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

105

If we have time…The final challenge

to the utility of frequency lists

As already seen

We are closing in on the Core of EnglishThis includes a smaller than expected group of

true homonyms

No corpus tool-kit so far deals with these systematicallyE.g. a Vocabprofile analysis does not distinguish bank

and bank

Page 106: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

106Go livehttp://www.lextutor.ca/concordancers/text_concord

Page 107: Learner corpus research - hands on Tom Cobb Didactique des langues / éducation Université du Québec à Montréal Saturday, October 31 8:15am - 10:15am lextutor.ca/cv/slrf_09/corpus.ppt.

[email protected] www.lextutor.ca

This PPT at http://www.lextutor.ca/cv/slrf_09/corpus.ppt

References list at http://www.lextutor.ca/cv/slrf_09/handout.doc