Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

23
Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008

Transcript of Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Page 1: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Free Swedish Word Listsor Hackers’ BLARK

Viggo Kann KTH, Stockholm

GSLT meeting January 26, 2008

Page 2: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

What is a free language resource?

Anyone can use it in an application Anyone can study it and modify it Anyone can take a copy of it Anyone can improve it, release the

improvements to the public, so that the whole community benefits

(based on four freedoms of free software, Richard Stallman)

Page 3: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Strong free software culture

GNU project FSF – Free Software Foundation GPL – GNU General Public License OSI – Open Software Initiative Linux, TeX, Emacs, GCC, MySQL,

PHP, Java, Python, Firefox

Page 4: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

First meeting of the Free Swedish Words group at KTH January 16

11 persons from around Sweden Lars Aronsson: project Runeberg and

Swedish Wikipedia (Wiktionary) Lars Törnquist and Sven Lange:

Swedish thesaurus built on Bring (1930) Christian Mattson: Lexin dictionaries

Page 5: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Niklas Johansson: Spelling error detection and correction in OpenOffice

Göran Andersson: DSSO – The large Swedish word list

Viggo Kann: Stava, Granskatagger, Synlex, Tvärslå Nordic dictionary

Per Starrbäck, Leif-Jöran Olsson, Tomas Padron-McCarthy, Erik Geijer

Page 6: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Plans for more free words Swedish synonyms in OpenOffice

(Niklas) Extending DSSO with synonyms,

associations etc (Göran) Building a free Swedish-English

dictionary (Viggo) Testing Swedish grammar checking in

Languagetool/OpenOffice (Viggo&Niklas)

Page 7: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Typical ways to construct a resource

…if you are a language technologist:

Get funding Use resources that

are free to use for researchers

Hire linguists to do the heavy jobs

…if you are a free software hacker:

Use other free resources

Collect data from lots of people using e.g. a wiki or a web form

Page 8: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Example: Synlex

Construct a Swedish dictionary of synonyms as a list of synonymous pairs

I don’t want to work a lot I don’t want to pay anyone to work The resulting list should become free

Page 9: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Ideas

Automatically construct a large set of word pairs that might be synonyms

Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

Page 10: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

More ideas

Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 25 M) of lookups each month

Users visit Lexin to translate words, and are thus probably motivated to help me

Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

Page 11: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

My plan

1. Construct lots of possible synonyms

2. Sort out bad synonym pairs automatically

3. Ask lots of users if the rest of the pairs are good synonyms

4. Analyze the gradings done by the users and decide which pairs to keep

Page 12: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Step 1: Construct lots of possible synonyms

If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish

{(w,v): y: ySE(w) vES(y)} or{(w,v): y: ySE(w) ySE(v)}

616 000 word pairs were generated

Page 13: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Step 2: Remove bad synonym pairs automatically

Use RI (Random Indexing)[Kanerva, Kristoferson, Holst 2000]to measure the distance between words represented in a large vector space

Keep pairs that have small enough distance in the vector space

Page 14: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Step 3: Ask lots of users if the rest of

the pairs are good synonyms

When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like:

Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

Page 15: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.
Page 16: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Step 4: Analyzing the gradings done by the users

1.2 millions gradings were made in less than 2 months

Grading statistics were analyzed on several occasions

Some users sent comments

Page 17: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

More and more interesting gradings as time goes by

0%

10%

20%

30%

40%

50%

60%

0 1 2 3 4 5 don'tknow

2005

2006

2007

Page 18: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Distribution of mean gradings of word pairs

0%

5%

10%

15%

20%

25%

30%

35%

40%

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5

20052006

Page 19: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Some statistics (January 2008)

2.8 M user gradings done 75 000 pairs (graded ≥ 2) in dictionary 108 000 pairs suggested by users 62 000 unique pairs suggested 20 000 of them have been accepted

Page 20: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Example: Synonyms to klass (class)5: rang (grade)

rank (rank)slag (kind)

4: kategori (category)stånd (social class)årskurs (grade)

3: fack (sphere)grad (degree)grupp (group)kvalitet (quality)nivå (level)

3: sort (sort)standard (standard)stil (style)

2: skikt (layer)storleksordning (magnitude)typ (type)

1: poäng (point)stadga (stability)

0: uppdrag (mission)utbilda (educate)

Page 21: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

How to prevent abuse?

Many gradings of a word pair are needed before it’s considered to be good

The pair to be graded is randomly picked from a very large list

Word pairs suggested by users are spell checked before they are added to the very large list

Page 22: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

People's definition of synonymy

Exact meaning of 'synonym' wasn’t defined

Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair

The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

Page 23: Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Links

www.dsso.se The large Swedish word list

www.nada.kth.se/stava Spell checker lexin.nada.kth.se/synlex.html

75 000 synonyms sv.wiktionary.org 50 000 word dictionary www.thesauruslex.com Hyperlexicon