1 Linguistic evidence within and across languages, word frequency lists and language learning Adam...

Post on 26-Dec-2015

217 views 1 download

Transcript of 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam...

1

Linguistic evidence within and across languages, word frequency lists and language learning

Adam KilgarriffLexical Computing LtdLexicography MasterClassUniversities of Leeds and Sussex

2

Linguistic evidence within and across languages, word frequency lists and language learning

OrWord lists are useful, but are they (could they be) scientific?

Leeds April 2010 Kilgarriff: KELLY 3

KELLY

EU lifelong learning project Goal: wordcards

Word in one lg on one side, other on other Language learning

9 languages, 36 pairs Arabic Chinese English Greek Italian

Norwegian Polish Russian Sweden Partners (incl Leeds) in 6 countries

(Leeds does Arabic Chinese Russian)

Leeds April 2010 Kilgarriff: KELLY 4

Method

Prepare monolingual lists Translate

Each into 8 target languages Professional translation services

Integrate, finalise Produce cards Goal for each set

9000 pairs at 6 levels

Leeds April 2010 Kilgarriff: KELLY 5

(Monolingual) Word Lists

Define a syllabus Which words get used in

Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing

NS: educational psychologists NNS: proficiency levels

Leeds April 2010 Kilgarriff: KELLY 6

Should be corpus-based

Most aren't Corpora are quite new

Easy to do better People will use them

Maybe also Governments

Leeds April 2010 Kilgarriff: KELLY 7

How

Take your corpus Count Voila

Leeds April 2010 Kilgarriff: KELLY 8

Complications

What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy

All are slightly different issues for each lg

Leeds April 2010 Kilgarriff: KELLY 9

What is a word; delimiters

Found between spaces Not for Chinese: segmentation

English co-operate, widely-held, farmer's, can't

Norwegian, Swedish Compounding, separable verbs

Arabic, Italian Clitics, al, ...

...

Leeds April 2010 Kilgarriff: KELLY 10

Words and lemmas

Word form (in text) invading

Lemma (dictionary headword) Invade for forms invade invades invaded

invading

Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

Leeds April 2010 Kilgarriff: KELLY 11

Grammatical classes

brush (verb) and brush (noun) Same item or different? Proposal: lempos

Recommendation: different With trepidation

Chinese: weak sense of noun, verb

Required (short) list of word classes for each lg

Same for all unless good reason

Leeds April 2010 Kilgarriff: KELLY 12

Marginal cases

Numbers twelve, seventeenth, fifties

Closed sets Days of week, months

Countries Capitals, nationalities, currencies, adjectives,

languages regional/dialects, political groups, religions

easter, christmas, islam, republican

Consistency before freq: policies needed

Leeds April 2010 Kilgarriff: KELLY 13

Multiwords

According to Linguistically a word but

Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

Base list: Recommendation: no multiwords But see below

Leeds April 2010 Kilgarriff: KELLY 14

Homonymy

bank (river) and bank (money) Word sense disambiguation

We can't do (with decent accuracy) We can't give freqs for senses

Lists of words not meanings Sometimes disconcerting See also below

Leeds April 2010 Kilgarriff: KELLY 15

Corpora

A fairly arbitrary sample of a lg To limit arbitrariness of wdlist

Make it big and diverse WACKY corpora

From web Can do for any language Web language: less formal

not mainly 'reporting' or fiction, cf news, BNC Good for lg learners

Leeds April 2010 Kilgarriff: KELLY 16

Comparing corpora

Corpora: new We are all beginners Best way to get sense of a corpus

Compare with another Keywords of each vs. other

Case studies Sketch Engine functions

Leeds April 2010 Kilgarriff: KELLY 17

Comparing frequency lists

• Web1T

– Present from google

– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English

• that’s 1,000,000,000,000

• Compare with BNC

– Take top 50,000 items of each

– 105 Web1T words not in BNC top50k

– 50 words with highest Web1T:BNC ratio

– 50 words with lowest ratio

Leeds April 2010 Kilgarriff: KELLY 18

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old

• 4 legal– trademarks pursuant accordance herein

Leeds April 2010 Kilgarriff: KELLY 19

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Leeds April 2010 Kilgarriff: KELLY 20

Observations

• Pronouns and past tense verbs

– Fiction

• Masc vs fem

• Yesterday

– Probably daily newspapers

• Constancy of ratios:

– He/him/himself

– She/her/herself

Leeds April 2010 Kilgarriff: KELLY

Corpus Factory

Many languages General corpus, 100m+ words

Fast High quality Comparable across languages

Leeds April 2010 Kilgarriff: KELLY

Gather Seed words

Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come

Extract text from Wiki. Wikipedia 2 Text

Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.

Leeds April 2010 Kilgarriff: KELLY

Web Corpus Statistics

Unique URLscollected

Afterfiltering

After de-duplication

Web corpus size MB Words

Dutch 97,584 22,424 19,708 739 MB 108.6 mHindi 71,613 20,051 13,321 424 MB 30.6 mTelugu 37,864 6,178 5,131 107 MB 3.4 mThai 120,314 23,320 20,998 1.2 GB 81.8 mVietnamese 106,076 27,728 19,646 1.2 GB 149 m

Leeds April 2010 Kilgarriff: KELLY

Evaluation

For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus.

Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken

Leeds April 2010 Kilgarriff: KELLY

Evaluation

1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our

For each languages Ratio: web:wiki

Leeds April 2010 Kilgarriff: KELLY

Results

ThaiWord Web Wiki Ratio

ผม 2935 366 8.00ดิ�ฉั�น 133 19 7.00ฉั�น 770 97 7.87คุณ 1722 320 5.36ท่�าน 2390 855 2.79กระผม 21 6 3.20ข้�าพเจ้�า 434 66 6.54ตั�ว 2108 2070 1.01ก� 179 148 1.20ชั้��น 431 677 0.63Total 11123 4624 2.40

Table : 1st and 2nd person pronouns in Web and Wiki corpora per million words

Leeds April 2010 Kilgarriff: KELLY

ANW NlWaCTheme Word English gloss Theme Word English glossBelgian Brussel (city)

Belgische Belgian

Vlaamse Flemish

Fiction keek Looked/watched

ReligionGod

Jezus

Christus

Gods

Newspapers

vorig previous

kreek watched/looked

procent Percent

miljoen million

miljard billion

frank (Belgian) Franc

Web

http

Geplaatst posted

Nl (Web domain)

Bewerk edited

Reacties Replies

www

Leeds April 2010 Kilgarriff: KELLY 28

Stages

Sort out corpora, tagging Automatically generate M1 lists

names, numbers, countries ... keywords vis-a-vis other corpora

Review, prepare M2 lists Translate

Leeds April 2010 Kilgarriff: KELLY 29

review - how?

points system 2 points for each of 6 levels 12 points for most freq words

deduct points for words in over-represented areas

add in words from other corpora

Leeds April 2010 Kilgarriff: KELLY 30

Translation database

On the web All translations entered into it Queries like

All Swedish words used as translations more than six times

All 1:1:1:1... 'simple cases'

Leeds April 2010 Kilgarriff: KELLY 31

Translations

Usually, of texts Words in context Kelly: no context

Usual principles don't apply

Instructions to translators

Leeds April 2010 Kilgarriff: KELLY 32

Using the database

Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq

word in several of the 8 other lgs So:

add it to English list Homonyms: could be similar

Leeds April 2010 Kilgarriff: KELLY 33

Monolingual master lists (M3)

Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs

Useful words which might not be hi-freq added words/multiwords must be above a

lower freq threshold

Target 9000 Important contribution

Leeds April 2010 Kilgarriff: KELLY 34

Numbers

Target: 9000 per list M2 lists

Estimate: 5000-6000 needed We add 3000-4000 multiwords and other

'back-translations'

Leeds April 2010 Kilgarriff: KELLY 35

From M3 lists to T2 lists

Leeds April 2010 Kilgarriff: KELLY 36

Current status

M1 lists prepared Lists checked, compared with other

lists Corpus-based and other

M2 lists prepared Translation underway

Leeds April 2010 Kilgarriff: KELLY 37

Big problems

Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello

Worse than anticipated Lists from spoken corpora, learner

corpora, needed Relation between

Competence for communicating The corpora at our disposal

Leeds April 2010 Kilgarriff: KELLY 38

Word lists are useful, but

...are they scientific? A tiny bit, occasionally

...could they be scientific? Yes

article of faith By the end of KELLY, we'll have a clearer

idea how

Leeds April 2010 Kilgarriff: KELLY 39

http://forbetterenglish.com