Knowledge Center for Processing Hebrew

26
Knowledge Center for Processing Hebrew Alon Itai – CS Technion

description

Knowledge Center for Processing Hebrew. Alon Itai – CS Technion. Tools for underrepresented languages. Computer tools and especially the Internet are Anglophile. Search engines are not tooled for morphologically rich languages. Search “ dog ” “ dogs ” “ and dogs ”. כלבים מאולפים מחפשים בית - PowerPoint PPT Presentation

Transcript of Knowledge Center for Processing Hebrew

Page 1: Knowledge Center  for Processing Hebrew

Knowledge Center for Processing Hebrew

Alon Itai – CS Technion

Page 2: Knowledge Center  for Processing Hebrew

Tools for underrepresented languages

Computer tools and especially the Internet are Anglophile.

Search engines are not tooled for morphologically rich languages.

Page 3: Knowledge Center  for Processing Hebrew

Search “dog” “dogs” “and dogs”

כלבויקיפדיה - כלב כלבים מאולפים מחפשים ביתיונק כלב((כלב | כלביםזאב ביתיכלב – הבית מכונה בלשון המדעכלב אוגר זהב עמותת SOS מתאיםכלבחיות - בחירת כלב - לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירהאתר המציע שידוכים בין גזעים, בייביסיטרים, כלב | כלבים

.תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעותDog ,אתר הכלבים מכיל הרבה מידע, מאמרים, קורסים

וכל הקשור בהםכלביםתמונות וקטעי וידאו של dog תמונת החודש · הכלב והחוק · רפואה וטיפול · כלביםגזעי

כלבי הצלה · קטעי וידאו · · · קורסים · מאמרים · לוח מודעות...תמונת השנה · פינת האימוץ

מאולפים מחפשים בית כלביםכלביםרוני אילוף

כלביםאתרי קטגורית !הב-הב אתר חיות המחמד של ישראל

כלביםקובי חזן אילוף כלביםהיחידה המיוחדת לאילוף

זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלבוכלבעל אלמנה

PETNET.co.il - רועים בלגיוכלבניופאונדלנד, כלבי רועים נחייהוכלבליווי, עזרת זולת רפואית

Page 4: Knowledge Center  for Processing Hebrew

Tools for underrepresented languages.

Computer tools and especially the Internet are Anglophile.

Search engines are not tooled for morphologically rich languages.

Email and chats do not cope well with strange alphabets

use (pidgin( English for communication,…

The local language is used less and less.

Page 5: Knowledge Center  for Processing Hebrew

The problem

Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools.

Even when tools are available – no open source Tools developed at Universities are not fit for

general use:not robust enough no standard interfacelack of documentation

Page 6: Knowledge Center  for Processing Hebrew

Duplication of Effort Every researcher has to redevelop her own

tools, before conducting original research For example: In Hebrew, there are many

morphological analyzers:1. Choueka and Shapira 1964,2. Ornan 1987, Lavie et al. 1988, 3. Bentur et al. 1992, 4. Segal 1999, 5. HSPELL6. Yona and Wintner 2005

Page 7: Knowledge Center  for Processing Hebrew

The Knowledge Center

In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew.

Its aim to develop products (software and databases( for processing Hebrew and make them available to the public, both in academia and industry.

Researchers from four universities are involved in the Center's activities.

Page 8: Knowledge Center  for Processing Hebrew

The researchers

Yoad Winter (Technion(, Shuly Wintner (Haifa University(, Michael Elhadad (Ben Gurion University(, Arnon Cohen (Ben Gurion University(, Yoram Singer (Hebrew University( Eli Shamir (Hebrew University( Alon Itai (Technion(

Page 9: Knowledge Center  for Processing Hebrew

The model

The ministry provides initial funds. The Center should be self-sustainable – it should

finance itself by selling products.

The problems: The market is too small, had it been large then

there would have been no need for the center. Contradicts our philosophy of open research and

open code.

Page 10: Knowledge Center  for Processing Hebrew

Licensing Policy

Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL.

Payments only for special services. Can get a non-exclusive license for

commercial use.

Page 11: Knowledge Center  for Processing Hebrew

XML

EXAMPLE

-<item id=“17580” script=“formal” transliterated=“bwqr” undotted=“בוקר“ dotted=“ר fקgֹּב“ >

<noun gender=“masculine” number=“singular” plural=“im”>

<replace gender=“masculine” number=“plural” script=“formal” transliterated=“bqarim” undotted=“בקרים“/>

</noun>

</item>

All products are represented by XML.•Readable both by machines and by humans•Enables using off-shelf tools for on screen presentation and validation

Info for the morphological parser

Page 12: Knowledge Center  for Processing Hebrew

XML (2(

Facilitates interface between tools:

For example, the output of the morphological analyzer is the input for the morphological disambiguator.

Thus one can match different morphological analyzers with different disambiguators and compare their results

Page 13: Knowledge Center  for Processing Hebrew

Products

Morphological analyzers Morphological disambiguators Lexicon Corpora Speech data base Tools for editing lexicons and tagging

corpora. PR: forum,…

Page 14: Knowledge Center  for Processing Hebrew

The lexicon by part of speech

noun10332preposition100

verb4485conjunction62Proper Name4227pronoun60

adjective1612interjection40

adverb352interrogative9

quantifier132negation6

Total : 21,417

Page 15: Knowledge Center  for Processing Hebrew

GUI for editing the lexicon

Page 16: Knowledge Center  for Processing Hebrew
Page 17: Knowledge Center  for Processing Hebrew

Morphological disambiguators

Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%.

Erel Segal combined a Brill-like method with a priori occurrence probabilities .

Meni Adler used HMM on whole words. All three disambiguators are available at

the Center.

Page 18: Knowledge Center  for Processing Hebrew

Corpora

Size

Unique tokens total

קורפוס

11,062,232319,666

11,216,867304,160 7Arutz

1,300,326166,780 Sha’ar la-matkhil

(dotted(

17,732,122 262,338 Knesset

Page 19: Knowledge Center  for Processing Hebrew

Corpora (2(

6000 sentences of manually tagged corpus (12,000 tokens(.

Page 20: Knowledge Center  for Processing Hebrew

Tree bank

6000 syntactically parsed sentences. Used for automatic parsing.

Page 21: Knowledge Center  for Processing Hebrew

Conclusions

The Center is an example of cooperation between researchers in several universities.

Many users have downloaded the products.

10 companies have purchased licenses.

Page 22: Knowledge Center  for Processing Hebrew

Conclusions (2(

Money is running out, … The model requires money, experts, and

commitment. Not suitable for languages with very few

speakers, or for poor communities.

Page 23: Knowledge Center  for Processing Hebrew

Modern Hebrew

Official Language of the State of Israel Spoken by 7 M people Related, but linguistically distinct, from Biblical

Hebrew. Morphologically rich

Page 24: Knowledge Center  for Processing Hebrew

Semitic Word Formation

root + pattern word

rootpattern

CaCaC yiCCoC

ktb

šbr

katab (he wrote( yiktob (he will write(

šabar (he broke( yišbor (he will break(

Page 25: Knowledge Center  for Processing Hebrew

Writing System

Most vowels are omitted Particles are prepended to words,

Example:

h – definite article,

b – preposition (in(

w – conjunction (and(

wbbyt = w + b + ha +byt

and in the house

Page 26: Knowledge Center  for Processing Hebrew

Morphological Ambiguity

Most words are morphologically ambiguous Example: šbth שבתה

1. šavta = šbt + CaCCa = stopped working

2. šavta = šbh + CaCCa = took prisoner

3. šabatah = her Saturday

4. še-b-te = that in tea

5. še-b-ha-te = that in the tea

6. še-bit-h = that her daughter