Post on 19-Jan-2016
description
Knowledge Center for Processing Hebrew
Alon Itai – CS Technion
Tools for underrepresented languages
Computer tools and especially the Internet are Anglophile.
Search engines are not tooled for morphologically rich languages.
Search “dog” “dogs” “and dogs”
כלבויקיפדיה - כלב כלבים מאולפים מחפשים ביתיונק כלב((כלב | כלביםזאב ביתיכלב – הבית מכונה בלשון המדעכלב אוגר זהב עמותת SOS מתאיםכלבחיות - בחירת כלב - לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירהאתר המציע שידוכים בין גזעים, בייביסיטרים, כלב | כלבים
.תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעותDog ,אתר הכלבים מכיל הרבה מידע, מאמרים, קורסים
וכל הקשור בהםכלביםתמונות וקטעי וידאו של dog תמונת החודש · הכלב והחוק · רפואה וטיפול · כלביםגזעי
כלבי הצלה · קטעי וידאו · · · קורסים · מאמרים · לוח מודעות...תמונת השנה · פינת האימוץ
מאולפים מחפשים בית כלביםכלביםרוני אילוף
כלביםאתרי קטגורית !הב-הב אתר חיות המחמד של ישראל
כלביםקובי חזן אילוף כלביםהיחידה המיוחדת לאילוף
זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלבוכלבעל אלמנה
PETNET.co.il - רועים בלגיוכלבניופאונדלנד, כלבי רועים נחייהוכלבליווי, עזרת זולת רפואית
Tools for underrepresented languages.
Computer tools and especially the Internet are Anglophile.
Search engines are not tooled for morphologically rich languages.
Email and chats do not cope well with strange alphabets
use (pidgin( English for communication,…
The local language is used less and less.
The problem
Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools.
Even when tools are available – no open source Tools developed at Universities are not fit for
general use:not robust enough no standard interfacelack of documentation
Duplication of Effort Every researcher has to redevelop her own
tools, before conducting original research For example: In Hebrew, there are many
morphological analyzers:1. Choueka and Shapira 1964,2. Ornan 1987, Lavie et al. 1988, 3. Bentur et al. 1992, 4. Segal 1999, 5. HSPELL6. Yona and Wintner 2005
The Knowledge Center
In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew.
Its aim to develop products (software and databases( for processing Hebrew and make them available to the public, both in academia and industry.
Researchers from four universities are involved in the Center's activities.
The researchers
Yoad Winter (Technion(, Shuly Wintner (Haifa University(, Michael Elhadad (Ben Gurion University(, Arnon Cohen (Ben Gurion University(, Yoram Singer (Hebrew University( Eli Shamir (Hebrew University( Alon Itai (Technion(
The model
The ministry provides initial funds. The Center should be self-sustainable – it should
finance itself by selling products.
The problems: The market is too small, had it been large then
there would have been no need for the center. Contradicts our philosophy of open research and
open code.
Licensing Policy
Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL.
Payments only for special services. Can get a non-exclusive license for
commercial use.
XML
EXAMPLE
-<item id=“17580” script=“formal” transliterated=“bwqr” undotted=“בוקר“ dotted=“ר fקgֹּב“ >
<noun gender=“masculine” number=“singular” plural=“im”>
<replace gender=“masculine” number=“plural” script=“formal” transliterated=“bqarim” undotted=“בקרים“/>
</noun>
</item>
All products are represented by XML.•Readable both by machines and by humans•Enables using off-shelf tools for on screen presentation and validation
Info for the morphological parser
XML (2(
Facilitates interface between tools:
For example, the output of the morphological analyzer is the input for the morphological disambiguator.
Thus one can match different morphological analyzers with different disambiguators and compare their results
Products
Morphological analyzers Morphological disambiguators Lexicon Corpora Speech data base Tools for editing lexicons and tagging
corpora. PR: forum,…
The lexicon by part of speech
noun10332preposition100
verb4485conjunction62Proper Name4227pronoun60
adjective1612interjection40
adverb352interrogative9
quantifier132negation6
Total : 21,417
GUI for editing the lexicon
Morphological disambiguators
Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%.
Erel Segal combined a Brill-like method with a priori occurrence probabilities .
Meni Adler used HMM on whole words. All three disambiguators are available at
the Center.
Corpora
Size
Unique tokens total
קורפוס
11,062,232319,666
11,216,867304,160 7Arutz
1,300,326166,780 Sha’ar la-matkhil
(dotted(
17,732,122 262,338 Knesset
Corpora (2(
6000 sentences of manually tagged corpus (12,000 tokens(.
Tree bank
6000 syntactically parsed sentences. Used for automatic parsing.
Conclusions
The Center is an example of cooperation between researchers in several universities.
Many users have downloaded the products.
10 companies have purchased licenses.
Conclusions (2(
Money is running out, … The model requires money, experts, and
commitment. Not suitable for languages with very few
speakers, or for poor communities.
Modern Hebrew
Official Language of the State of Israel Spoken by 7 M people Related, but linguistically distinct, from Biblical
Hebrew. Morphologically rich
Semitic Word Formation
root + pattern word
rootpattern
CaCaC yiCCoC
ktb
šbr
katab (he wrote( yiktob (he will write(
šabar (he broke( yišbor (he will break(
Writing System
Most vowels are omitted Particles are prepended to words,
Example:
h – definite article,
b – preposition (in(
w – conjunction (and(
wbbyt = w + b + ha +byt
and in the house
Morphological Ambiguity
Most words are morphologically ambiguous Example: šbth שבתה
1. šavta = šbt + CaCCa = stopped working
2. šavta = šbh + CaCCa = took prisoner
3. šabatah = her Saturday
4. še-b-te = that in tea
5. še-b-ha-te = that in the tea
6. še-bit-h = that her daughter
…