Knowledge Center for Processing Hebrew

Alon Itai – CS Technion

Tools for underrepresented languages

Computer tools and especially the Internet are Anglophile.

Search engines are not tooled for morphologically rich languages.

Search “dog” “dogs” “and dogs”

כלבויקיפדיה - כלב כלבים מאולפים מחפשים ביתיונק כלב((כלב | כלביםזאב ביתיכלב – הבית מכונה בלשון המדעכלב אוגר זהב עמותת SOS מתאיםכלבחיות - בחירת כלב - לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירהאתר המציע שידוכים בין גזעים, בייביסיטרים, כלב | כלבים

.תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעותDog ,אתר הכלבים מכיל הרבה מידע, מאמרים, קורסים

וכל הקשור בהםכלביםתמונות וקטעי וידאו של dog תמונת החודש · הכלב והחוק · רפואה וטיפול · כלביםגזעי

כלבי הצלה · קטעי וידאו · · · קורסים · מאמרים · לוח מודעות...תמונת השנה · פינת האימוץ

מאולפים מחפשים בית כלביםכלביםרוני אילוף

כלביםאתרי קטגורית !הב-הב אתר חיות המחמד של ישראל

כלביםקובי חזן אילוף כלביםהיחידה המיוחדת לאילוף

זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלבוכלבעל אלמנה

PETNET.co.il - רועים בלגיוכלבניופאונדלנד, כלבי רועים נחייהוכלבליווי, עזרת זולת רפואית

Tools for underrepresented languages.

Computer tools and especially the Internet are Anglophile.

Search engines are not tooled for morphologically rich languages.

Email and chats do not cope well with strange alphabets

use (pidgin( English for communication,…

The local language is used less and less.

The problem

Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools.

Even when tools are available – no open source Tools developed at Universities are not fit for

general use:not robust enough no standard interfacelack of documentation

Duplication of Effort Every researcher has to redevelop her own

tools, before conducting original research For example: In Hebrew, there are many

morphological analyzers:1. Choueka and Shapira 1964,2. Ornan 1987, Lavie et al. 1988, 3. Bentur et al. 1992, 4. Segal 1999, 5. HSPELL6. Yona and Wintner 2005

The Knowledge Center

In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew.

Its aim to develop products (software and databases( for processing Hebrew and make them available to the public, both in academia and industry.

Researchers from four universities are involved in the Center's activities.

The researchers

Yoad Winter (Technion(, Shuly Wintner (Haifa University(, Michael Elhadad (Ben Gurion University(, Arnon Cohen (Ben Gurion University(, Yoram Singer (Hebrew University( Eli Shamir (Hebrew University( Alon Itai (Technion(

The model

The ministry provides initial funds. The Center should be self-sustainable – it should

finance itself by selling products.

The problems: The market is too small, had it been large then

there would have been no need for the center. Contradicts our philosophy of open research and

open code.

Licensing Policy

Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL.

Payments only for special services. Can get a non-exclusive license for

commercial use.

EXAMPLE

-<item id=“17580” script=“formal” transliterated=“bwqr” undotted=“בוקר“ dotted=“ר fקgֹּב“ >

</noun>

</item>

All products are represented by XML.•Readable both by machines and by humans•Enables using off-shelf tools for on screen presentation and validation

Info for the morphological parser

XML (2(

Facilitates interface between tools:

For example, the output of the morphological analyzer is the input for the morphological disambiguator.

Thus one can match different morphological analyzers with different disambiguators and compare their results

Products

Morphological analyzers Morphological disambiguators Lexicon Corpora Speech data base Tools for editing lexicons and tagging

corpora. PR: forum,…

The lexicon by part of speech

noun10332preposition100

verb4485conjunction62Proper Name4227pronoun60

adjective1612interjection40

adverb352interrogative9

quantifier132negation6

Total : 21,417

GUI for editing the lexicon

Morphological disambiguators

Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%.

Erel Segal combined a Brill-like method with a priori occurrence probabilities .

Meni Adler used HMM on whole words. All three disambiguators are available at

the Center.

Corpora

Unique tokens total

קורפוס

11,062,232319,666

11,216,867304,160 7Arutz

1,300,326166,780 Sha’ar la-matkhil

(dotted(

17,732,122 262,338 Knesset

Corpora (2(

6000 sentences of manually tagged corpus (12,000 tokens(.

Tree bank

6000 syntactically parsed sentences. Used for automatic parsing.

Conclusions

The Center is an example of cooperation between researchers in several universities.

Many users have downloaded the products.

10 companies have purchased licenses.

Conclusions (2(

Money is running out, … The model requires money, experts, and

commitment. Not suitable for languages with very few

speakers, or for poor communities.

Modern Hebrew

Official Language of the State of Israel Spoken by 7 M people Related, but linguistically distinct, from Biblical

Hebrew. Morphologically rich

Semitic Word Formation

root + pattern word

rootpattern

CaCaC yiCCoC

katab (he wrote( yiktob (he will write(

šabar (he broke( yišbor (he will break(

Writing System

Most vowels are omitted Particles are prepended to words,

Example:

h – definite article,

b – preposition (in(

w – conjunction (and(

wbbyt = w + b + ha +byt

and in the house

Morphological Ambiguity

Most words are morphologically ambiguous Example: šbth שבתה

1. šavta = šbt + CaCCa = stopped working

2. šavta = šbh + CaCCa = took prisoner

3. šabatah = her Saturday

4. še-b-te = that in tea

5. še-b-ha-te = that in the tea

6. še-bit-h = that her daughter

Knowledge Center for Processing Hebrew

Documents

Transcript of Knowledge Center for Processing Hebrew

3886076 the Processing of Root Morphemes in Hebrew Contrasting Localist and Distributed Accounts

2Information Processing 2.1Basic Knowledge of Computer Platforms.

Hebrew University Image Processing - 2006 Why Mosaic · 2006-12-18 · 2 7 Hebrew University Image Processing - 2006 Approaches Assuming that the images have already been aligned

Knowledge and Cognitive Processing Dimensions

Knowledge Center for Processing Hebrew Alon Itai – CS Technion.

The Integration of Processing Components and Knowledge.

Knowledge processing

Assessing the Knowledge Processing Environment

Understanding Traditional Meat Processing Knowledge among ...

ARTIFICIAL INTELLIGENCE TECHNIQUES Knowledge Processing 1.

Symbolic Knowledge Processing for the Acquisition of

Knowledge-based seismogram processing by mental images ...

Hebrew University Image Processing - 2006 Today Outline Digital Image Processing · 2006. 10. 26. · Hebrew University Image Processing - 2006 Matlab Basics •Digital image representation

Product development knowledge-processing system

Knowledge Processing for Autonomous Robots

Combining Perception and Knowledge Processing for Everyday ...ai.uni-bremen.de/_media/paper/iros10kcopman.pdf · Combining Perception and Knowledge Processing for Everyday ... the

Knowledge Processing 2

B2B Marketing Strategies for Knowledge Processing Industry

Product Knowledge In Grains Processing

INTRODUCTION TO HEBREW III · Grown in your enjoyment of the Hebrew language. Grown in your commitment to use your Hebrew knowledge and skills in a life-long study of the Hebrew Bible.