Dyslexia Guild Conference 2013 - Online Corpus

Post on 21-Jan-2015

438 views 1 download

Tags:

description

Online Corpus: a structured set of texts where information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Dominik Lukes

Transcript of Dyslexia Guild Conference 2013 - Online Corpus

dyslexiaaction.org.uk

Online Corpus Literacy Teachers’

Best Friend

Dominik Lukeš http://dominiklukes.net

Dyslexia Guild

Summer Conference 2013

Outline

dyslexiaaction.org.uk

http://www.flickr.com/photos/adactio/3563832656

What is a corpus

Answering questions with a

corpus

The language of corpus

searches

The corpus and the

classroom

Practice

Corpus / Corpora

dyslexiaaction.org.uk

????

dyslexiaaction.org.uk

of about

language

knowledge

http://www.flickr.com/photos/missturner/3029700617/

Prescriptivism

dyslexiaaction.org.uk

… how language should be used

Descriptivism

… how language is used

v

dyslexiaaction.org.uk

“Most of the prescriptive rules

of the language mavens

make no sense on any level.

They are bits of folklore that

originated for screwball

reasons several hundred

years ago… For as long as

they have existed, speakers

have flouted them…”

dyslexiaaction.org.uk

“intellectual abdication”

“should be ashamed”

“current around 1900”

“a perversion of grammatical

education”

“blind to textual evidence even

when he himself exhibits it”

“dishonest and stupid”

“vile little compendium

of tripe about style”

Grammarian

Geoffrey K Pullum on …

“More passives in Orwell's

pompous essay with the

warning about how you

mustn't use them than in any

periodical you can lay your

hands on! “

This usage stuff is not straightforward and

easy. If ever someone tells you that the rules

of English grammar are simple and logical

and you should just learn them and obey

them, walk away, because you're getting

advice from a fool.

http://languagelog.ldc.upenn.edu/nll/?p=2790

Corpus

dyslexiaaction.org.uk

Key modern tool for finding out

about how language works…

Corpus

dyslexiaaction.org.uk

… is a large database of

representative language

samples …

Corpus

dyslexiaaction.org.uk

… 100s of millions of words

from (mostly) written language

in different genres in small

samples (~2000 words) …

Corpus

dyslexiaaction.org.uk

… used for linguistic research,

making dictionaries, writing

grammars, …

dyslexiaaction.org.uk

Corpora available for teachers

dyslexiaaction.org.uk

http://corpus.byu.edu

Access to COCA and related

BYU corpora is free…

dyslexiaaction.org.uk

…but free registration

required for more than

~10 queries a day

dyslexiaaction.org.uk

dyslexiaaction.org.uk

Brown – the grandfather

COCA

BNC

Webcorp

Google

dyslexiaaction.org.uk

dyslexiaaction.org.uk

dyslexiaaction.org.uk

http://www.flickr.com/photos/atoach/3900591006/

Searching a corpus early on in the

process of making a generalization

can save you a lot of unpleasant

surprises later.

How do we use the word

dyslexia?

We speak more often of dyslexic children

than adults.

We speak more often of dyslexia than any

other dys- word.

dyslexiaaction.org.uk

Concordance BNC:

dyslexic [n*]

COCA:

dyslexic [n*]

http://www.americancorpus.org/

http://corpus.byu.edu/bnc

dyslexiaaction.org.uk

COCA:

dys*

Suffixing

rules

dyslexiaaction.org.uk

*yed

*ied

Suffixing

rules

dyslexiaaction.org.uk

*yed

*ied

played

stayed

portrayed

enjoyed

unemployed

surveyed

died

tried

married

worried

identified

applied

The Corpus Magic

dyslexiaaction.org.uk

*

[ ]

?

Different corpora use slightly

different codes. Read the

manual.

[n* ]

The Corpus Magic

dyslexiaaction.org.uk

*

[ ]

? Any one character

Any number of

characters (incl 0)

Lemma

(all inflectional

forms of a word)

Different corpora use slightly

different codes. Read the

manual.

[n* ] Part of speech tags

(e.g. nouns)

dyslexiaaction.org.uk

*each each, reach, beach, teach,

outreach, …, impeach, …

teach* teachers, teaching, …,

teachable, teacher-librarians, …

t*ch touch, teach, tech, torch,

trench, twitch, …, three-inch, …

teach * teach the, teach us, teach

students, …

dyslexiaaction.org.uk

?each reach, beach, teach, peach,

leach, keach, …

each? each- (1), each# (1) [ie nothing]

?each? peachy, bleachy, teacha, reachs

(2) [ie spelling error], …

t?ch tech, tach, toch, tuch, tsch, tich

t??ch touch, teach, torch, tisch, …

[Lemma]

dyslexiaaction.org.uk

Part of speech tags

dyslexiaaction.org.uk

[run].[n*]

[run] [n*]

Common tags

dyslexiaaction.org.uk

[n*] noun [NN2] plural nouns

[v*] verb [VVD] verb past tense

[aj*] (BNC) / [j*](COCA) adjective

[av*] (BNC) / [r*](COCA) adverb

Help

dyslexiaaction.org.uk

dyslexiaaction.org.uk

dyslexiaaction.org.uk

You can also

dyslexiaaction.org.uk

cats and dogs search for idioms

?each*s combine wildcards

[=pretty] search for synonyms

car|bike|horse search for alternatives

used -car exclude searches

For more details see:

Concordance + KWIC

dyslexiaaction.org.uk

*ies.[N*]

dyslexiaaction.org.uk

KWIC – Key-Word In Context

*ies.[N*]

Limit searches by genre

dyslexiaaction.org.uk

Other questions corpus can

answer Are there more nouns or verbs ending in -ies?

*ies.[V*] vs. *ies.[N*]

Are there four-letter verbs ending in -ed in the present

tense? ??ed.[VVB]

What are the most common adjectives describing students

vs. pupils. [j*] [student] vs. [j*] [pupil]

What do we say teachers do most often?

[teacher] [vvb]

dyslexiaaction.org.uk

Corpus, rules, and regularity

dyslexiaaction.org.uk

http://www.flickr.com/photos/51505078@N00/352492687

pre*

*ed

*ies.[V*]

Collocations Limits on variability

dyslexiaaction.org.uk

See also Kennedy, p. 80-23

Collocations (cont) Limits on variability

dyslexiaaction.org.uk

See also Kennedy, p. 80-23

Collocations (cont)

dyslexiaaction.org.uk

[teacher] must [v*]

Idioms and set phrases

dyslexiaaction.org.uk

275 results

359 results

Google as a Corpus

dyslexiaaction.org.uk

"put the search text in quotes"

use * for the search item

dyslexiaaction.org.uk

Google as a Corpus

Pros & Cons

dyslexiaaction.org.uk

PRO: rare, low frequency usage,

uptodate usage

CON: no sampling, no frequency

sort, no genre limit, no part

of speech tags

Google results counts are only

rough estimates…

dyslexiaaction.org.uk

http://searchengineland.com/why-google-cant-count-results-properly-53559

Different people searching in different geographic locations can get different

numbers

Sometimes searching for A gives fewer results than searching for A without B

…but Google fights can be fun

dyslexiaaction.org.uk

WebCorp is makes Google

search results linguist-friendly

dyslexiaaction.org.uk

Avoid Common Corpus Errors

dyslexiaaction.org.uk

Be aware of limitations: sampling,

coverage, size, presence of typos

and errors, bad part of speech

tagging

Beware of low frequency results

Beware of homographs

Check results come from multiple

sources

Check KWIC to confirm relevance

Limit search by genre http://www.flickr.com/photos/andreassolberg/433734311

Check examples and sources

dyslexiaaction.org.uk

Always check low frequency

results

dyslexiaaction.org.uk

must [v*] [n*]

…sometimes they come from the

same source

False roots

http://etymonline.com

corner, silly, preface,

cockroach, protest, stable …

Make your own corpus with

TextSTAT

http://neon.niederlandistik.fu-berlin.de/en/textstat

Make your own corpus with

AntConc

dyslexiaaction.org.uk

http://www.antlab.sci.waseda.ac.jp/software.html

Corpus in the

classroom

dyslexiaaction.org.uk

teacher preparation

student discovery

Teacher preparation

dyslexiaaction.org.uk

find relevant, common examples

prepare worksheets

check for exceptions

find out answers to student

questions about rules and usage

Student discovery

dyslexiaaction.org.uk

show search results to students to

work out rules or word meanings

teach students how to search for

questions

ask students to give each other

puzzles for searching

For heavy classroom use…

dyslexiaaction.org.uk

register for

group access

to prevent

spam lock out

Corpus v dictionary

dyslexiaaction.org.uk

Non-classroom

corpus use

dyslexiaaction.org.uk

supplement dictionary

cross-word puzzles

check typical usage

when writing

Where to go next?

dyslexiaaction.org.uk

http://www.corpora4learning.net

Thank you Contact http://dominiklukes.net

dyslexiaaction.org.uk