Post on 26-Dec-2015
1
Linguistic evidence within and across languages, word frequency lists and language learning
Adam KilgarriffLexical Computing LtdLexicography MasterClassUniversities of Leeds and Sussex
2
Linguistic evidence within and across languages, word frequency lists and language learning
OrWord lists are useful, but are they (could they be) scientific?
Leeds April 2010 Kilgarriff: KELLY 3
KELLY
EU lifelong learning project Goal: wordcards
Word in one lg on one side, other on other Language learning
9 languages, 36 pairs Arabic Chinese English Greek Italian
Norwegian Polish Russian Sweden Partners (incl Leeds) in 6 countries
(Leeds does Arabic Chinese Russian)
Leeds April 2010 Kilgarriff: KELLY 4
Method
Prepare monolingual lists Translate
Each into 8 target languages Professional translation services
Integrate, finalise Produce cards Goal for each set
9000 pairs at 6 levels
Leeds April 2010 Kilgarriff: KELLY 5
(Monolingual) Word Lists
Define a syllabus Which words get used in
Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing
NS: educational psychologists NNS: proficiency levels
Leeds April 2010 Kilgarriff: KELLY 6
Should be corpus-based
Most aren't Corpora are quite new
Easy to do better People will use them
Maybe also Governments
Leeds April 2010 Kilgarriff: KELLY 7
How
Take your corpus Count Voila
Leeds April 2010 Kilgarriff: KELLY 8
Complications
What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy
All are slightly different issues for each lg
Leeds April 2010 Kilgarriff: KELLY 9
What is a word; delimiters
Found between spaces Not for Chinese: segmentation
English co-operate, widely-held, farmer's, can't
Norwegian, Swedish Compounding, separable verbs
Arabic, Italian Clitics, al, ...
...
Leeds April 2010 Kilgarriff: KELLY 10
Words and lemmas
Word form (in text) invading
Lemma (dictionary headword) Invade for forms invade invades invaded
invading
Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
Leeds April 2010 Kilgarriff: KELLY 11
Grammatical classes
brush (verb) and brush (noun) Same item or different? Proposal: lempos
Recommendation: different With trepidation
Chinese: weak sense of noun, verb
Required (short) list of word classes for each lg
Same for all unless good reason
Leeds April 2010 Kilgarriff: KELLY 12
Marginal cases
Numbers twelve, seventeenth, fifties
Closed sets Days of week, months
Countries Capitals, nationalities, currencies, adjectives,
languages regional/dialects, political groups, religions
easter, christmas, islam, republican
Consistency before freq: policies needed
Leeds April 2010 Kilgarriff: KELLY 13
Multiwords
According to Linguistically a word but
Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords
Base list: Recommendation: no multiwords But see below
Leeds April 2010 Kilgarriff: KELLY 14
Homonymy
bank (river) and bank (money) Word sense disambiguation
We can't do (with decent accuracy) We can't give freqs for senses
Lists of words not meanings Sometimes disconcerting See also below
Leeds April 2010 Kilgarriff: KELLY 15
Corpora
A fairly arbitrary sample of a lg To limit arbitrariness of wdlist
Make it big and diverse WACKY corpora
From web Can do for any language Web language: less formal
not mainly 'reporting' or fiction, cf news, BNC Good for lg learners
Leeds April 2010 Kilgarriff: KELLY 16
Comparing corpora
Corpora: new We are all beginners Best way to get sense of a corpus
Compare with another Keywords of each vs. other
Case studies Sketch Engine functions
Leeds April 2010 Kilgarriff: KELLY 17
Comparing frequency lists
• Web1T
– Present from google
– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English
• that’s 1,000,000,000,000
• Compare with BNC
– Take top 50,000 items of each
– 105 Web1T words not in BNC top50k
– 50 words with highest Web1T:BNC ratio
– 50 words with lowest ratio
Leeds April 2010 Kilgarriff: KELLY 18
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old
• 4 legal– trademarks pursuant accordance herein
Leeds April 2010 Kilgarriff: KELLY 19
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Leeds April 2010 Kilgarriff: KELLY 20
Observations
• Pronouns and past tense verbs
– Fiction
• Masc vs fem
• Yesterday
– Probably daily newspapers
• Constancy of ratios:
– He/him/himself
– She/her/herself
Leeds April 2010 Kilgarriff: KELLY
Corpus Factory
Many languages General corpus, 100m+ words
Fast High quality Comparable across languages
Leeds April 2010 Kilgarriff: KELLY
Gather Seed words
Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come
Extract text from Wiki. Wikipedia 2 Text
Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
Leeds April 2010 Kilgarriff: KELLY
Web Corpus Statistics
Unique URLscollected
Afterfiltering
After de-duplication
Web corpus size MB Words
Dutch 97,584 22,424 19,708 739 MB 108.6 mHindi 71,613 20,051 13,321 424 MB 30.6 mTelugu 37,864 6,178 5,131 107 MB 3.4 mThai 120,314 23,320 20,998 1.2 GB 81.8 mVietnamese 106,076 27,728 19,646 1.2 GB 149 m
Leeds April 2010 Kilgarriff: KELLY
Evaluation
For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus.
Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken
Leeds April 2010 Kilgarriff: KELLY
Evaluation
1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our
For each languages Ratio: web:wiki
Leeds April 2010 Kilgarriff: KELLY
Results
ThaiWord Web Wiki Ratio
ผม 2935 366 8.00ดิ�ฉั�น 133 19 7.00ฉั�น 770 97 7.87คุณ 1722 320 5.36ท่�าน 2390 855 2.79กระผม 21 6 3.20ข้�าพเจ้�า 434 66 6.54ตั�ว 2108 2070 1.01ก� 179 148 1.20ชั้��น 431 677 0.63Total 11123 4624 2.40
Table : 1st and 2nd person pronouns in Web and Wiki corpora per million words
Leeds April 2010 Kilgarriff: KELLY
ANW NlWaCTheme Word English gloss Theme Word English glossBelgian Brussel (city)
Belgische Belgian
Vlaamse Flemish
Fiction keek Looked/watched
ReligionGod
Jezus
Christus
Gods
Newspapers
vorig previous
kreek watched/looked
procent Percent
miljoen million
miljard billion
frank (Belgian) Franc
Web
http
Geplaatst posted
Nl (Web domain)
Bewerk edited
Reacties Replies
www
Leeds April 2010 Kilgarriff: KELLY 28
Stages
Sort out corpora, tagging Automatically generate M1 lists
names, numbers, countries ... keywords vis-a-vis other corpora
Review, prepare M2 lists Translate
Leeds April 2010 Kilgarriff: KELLY 29
review - how?
points system 2 points for each of 6 levels 12 points for most freq words
deduct points for words in over-represented areas
add in words from other corpora
Leeds April 2010 Kilgarriff: KELLY 30
Translation database
On the web All translations entered into it Queries like
All Swedish words used as translations more than six times
All 1:1:1:1... 'simple cases'
Leeds April 2010 Kilgarriff: KELLY 31
Translations
Usually, of texts Words in context Kelly: no context
Usual principles don't apply
Instructions to translators
Leeds April 2010 Kilgarriff: KELLY 32
Using the database
Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq
word in several of the 8 other lgs So:
add it to English list Homonyms: could be similar
Leeds April 2010 Kilgarriff: KELLY 33
Monolingual master lists (M3)
Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs
Useful words which might not be hi-freq added words/multiwords must be above a
lower freq threshold
Target 9000 Important contribution
Leeds April 2010 Kilgarriff: KELLY 34
Numbers
Target: 9000 per list M2 lists
Estimate: 5000-6000 needed We add 3000-4000 multiwords and other
'back-translations'
Leeds April 2010 Kilgarriff: KELLY 35
From M3 lists to T2 lists
Leeds April 2010 Kilgarriff: KELLY 36
Current status
M1 lists prepared Lists checked, compared with other
lists Corpus-based and other
M2 lists prepared Translation underway
Leeds April 2010 Kilgarriff: KELLY 37
Big problems
Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello
Worse than anticipated Lists from spoken corpora, learner
corpora, needed Relation between
Competence for communicating The corpora at our disposal
Leeds April 2010 Kilgarriff: KELLY 38
Word lists are useful, but
...are they scientific? A tiny bit, occasionally
...could they be scientific? Yes
article of faith By the end of KELLY, we'll have a clearer
idea how
Leeds April 2010 Kilgarriff: KELLY 39
http://forbetterenglish.com