Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March...

68
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002

Transcript of Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March...

Page 1: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Overview of the Language Technologies Institute

and AVENUE Project

Jaime Carbonell, Director

March 2, 2002

Page 2: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

School of Computer Science at Carnegie Mellon University

• Computer Science Department (theory, systems)

• Robotics Institute (space, industry, medical)

• Language Technologies Institute (MT, speech, IR)

• Human-Computer Interaction Inst. (Ergonomics)

• Institute for Software Research Int. (SE)

• Center for Automated Learning & Disc (DM)

• Entertainment Technologies (Animation, graphics)

Page 3: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Language Technologies Institute

• Founded in 1986 as the Center for Machine Translation (CMT).

• Became Language Technologies Institute in 1996, unifying CMT, Comp Ling program.

• Current Size: 110 FTEs– 18 Faculty– 22 Staff– 60 Graduate Students (45 PhD, 15 MLT)– 10 Visiting Scholars

Page 4: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Bill of Rights

• Get the rightright information

• To the right people

• At the right time

• On the right medium

• In the right language

• With the right level of detail

Page 5: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“The Right Information”

• Find the right papers, web-pages, …– Language modeling for IR (Lafferty, Callan)– Translingual IR (Yang, Carbonell, Brown)– Distributed IR (Callan)

• Seek Novelty (Carbonell, Yang, …)– Avoid massive redundancy – Detect new events in streaming data

Page 6: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“…to the Right People”

• Text Categorization– Multi-class classifiers by topic (Yang)– Boosting for genre learning (Carbonell)

• Filtering & Routing– Topic tracking in streaming data (Yang)– TREC filtering/routing (Callan, Yang)

Page 7: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“…at the Right Time”

• I.e. when the information is needed• Anticipatory analysis

– Helpful info without being asked

• Context-aware learning– Interactivity with user– Utility theory (when to ask, when to give new

or deeper info, when to back off)

(We have not yet taken up this challenge)

Page 8: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“…on the Right Medium”

• Speech Recognition– SPHINX (Reddy, Rudnicky Rosenfeld, …)– JANUS (Waibel, Schultz, …)

• Speech Synthesis– Festival (Black, Lenzo)

• Handwriting & Gesture Recognition– ISL (Waibel, J. Yang)

• Multimedia Integration (CSD)– Informedia (Wactlar, Hauptmann, …)

Page 9: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“… in the Right Language”

• High-Accuracy Interlingual MT– KANT (Nyberg, Mitamura)

• Parallel Corpus-Trainable MT– Statistical MT (Lafferty, Vogel)

– Example-Based MT (Brown, Carbonell)

– AVENUE Instructible MT (Levin, Lavie, Carbonell)

• Speech-to-speech MT– JANUS/DIPLOMAT/AVENUE (Waibel, Frederking,

Levin, Schultz, Vogel, Lafferty, Black, …)

Page 10: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“…at the Right Level of Detail”

• Multidocument Summarization (Carbonell, Waibel, Yang, …)

• Question Answering (Carbonell, Callan, Nyberg, Mitamura, Lavie, …)– New thrust (JAVELIN project)– Combines Q-analysis, IR, extraction, planning,

user-feedback, utility analysis, answer synthesis, …

Page 11: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

We also Engage in:

• Tutoring Systems (Eskenazi, Callan)• Linguistic Analysis (Levin, Mitamura…)• Robust Parsing Algorithms (Lavie, …)• Interface & communication language

design (Rosenfeld, Waibel, Rudnicky)• Complex System Design (Nyberg, Callan)• Machine Learning (Carbonell, Lafferty,

Yang, Rosenfeld, Lavie, …)

Page 12: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

How we do it at LTI

• Data-driven methods

– Statistical learning

– Corpora-based

• Examples:

– Statistical MT

– Example-based MT

– Text categorization

– Novelty detection

– Translingual IR

• Knowledge-based

– Symbolic learning

– Linguistic analysis

– Knowledge represent.

• Examples:

– Interlingual MT

– Parsing & generation

– Discourse modeling

– Language tutoring

Page 13: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Hot Research Topics

• Automated Q/A from web/text (JAVELIN)

• Endangered Language MT (AVENUE)

• Novelty detection and tracking (TDT)

• Theoretical foundations of Language modeling, and knowledge discovery

(All require multi-discipline approach.)

Page 14: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Educational Programs at LTI

• PhD Program– 45 PhD students, all research areas of LTI– Individual and joint advisorships– “Marriage” process in mid-September to match

faculty/projects with new students– Years 1-2 50% research, 50% courses– Years 3-N => 100% research (target: N=5)– Semi-annual student evaluations

Page 15: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Education at LTI (II)

• MLT Program (1-2 years)– Courses are more central– 50% on Project/research work (if funded)– Many MLTs apply for PhD admission

• CALL Masters (1 year)– New program joint with Modern Languages

• Certificate program (1 semester)

Page 16: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

The AVENUE Project:Machine Translation and Language

Tools for Minority Languages

Jaime Carbonell, Lori Levin, Alon Lavie, Tanja Schultz, Eric Petersen,

Kathrin Probst, Christian Monson, …

Page 17: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Machine Translation of Indigenous Languages

• Policy makers have access to information about indigenous people.– Epidemics, crop failures, etc.

• Indigenous people can participate in – Health care– Education– Government– Internet

without giving up their languages.

Page 18: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

History of AVENUE

• Arose from a series of joint workshops of NSF and OAS.

• Workshop recommendations: – Create multinational projects using information

technology to:• provide immediate benefits to governments and citizens

• develop critical infrastructure for communication and collaborative research

– training researchers and engineers

– advancing science and technology

Page 19: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Resources for MT

• People who speak the language.• Linguists who speak the language.• Computational linguists who speak the language.• Text on paper.• Text on line.• Comparable text on paper or on line.• Parallel text on paper or on line.• Annotated text (part of speech, morphology, etc.)• Dictionaries (mono-lingual or bilingual) on paper or on line.• Recordings of spoken language.• Recordings of spoken language that are transcribed.• Etc.

Page 20: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

MT for Indigenous Languages

• Minimal amount of parallel text

• Possibly competing standards for orthography/spelling

• Maybe not so many trained linguists

• Access to native informants possible

• Need to minimize development time and cost

Page 21: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Two Technical Approaches

• Generalized EBMT• Parallel text 50K-2MB

(uncontrolled corpus)• Rapid implementation• Proven for major L’s

with reduced data

• Transfer-rule learning • Elicitation (controlled)

corpus to extract grammatical properties

• Seeded version-space learning

Page 22: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Types of Machine Translation

Interlingua

Syntactic Parsing

Semantic Analysis

Sentence Planning

Text Generation

Source (Arabic)

Target(English)

Transfer Rules

Direct: SMT, EBMT

Page 23: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Multi-Engine Machine Translation

• MT Systems have different strengths– Rapidly adaptable: Statistical, example-based

– Good grammar: Rule-Based (linguisitic) MT

– High precision in narrow domains: KBMT

– Minority Language MT: Learnable from informant

• Combine results of parallel-invoked MT– Select best of multiple translations

– Selection based on optimizing combination of:• Target language joint-exponential model

• Confidence scores of individual MT engines

Page 24: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Illustration of Multi-Engine MT

El punto de descarge

The drop-off point

se cumplirá en

will comply with

el puente Agua Fria

The cold Bridgewater

El punto de descarge

The discharge point

se cumplirá en

will self comply in

el puente Agua Fria

the “Agua Fria” bridge

El punto de descarge

Unload of the point

se cumplirá en

will take place at

el puente Agua Fria

the cold water of bridge

Page 25: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

EBMT Example

English: I would like to meet her.Mapudungun: Ayükefun trawüael fey engu.

English: The tallest man is my father.Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.

English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

Page 26: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Architecture Diagram

User

Learning Module

ElicitationProcess

SVSLearning Process

TransferRules

Run-Time Module SLInput

SL Parser

TransferEngine

TLGenerator

EBMTEngine

UnifierModule

TLOutput

Page 27: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Version Space Learning

• Symbolic learning from + and – examples

• Invented by Mitchell, refined by Hirsch

• Builds generalization lattice implicitly

• Bounded by G and S sets

• Worse-case exponential complexity (in size of G and S)

• Slow convergence rate

Page 28: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Example of Transfer Rule Lattice

Page 29: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Seeded Version Spaces

• Generate concept seed from first + example– Generalization-level hypothesis (POS + feature

agreement for T-rules in NICE)

• Generalization/specialization level bounds– Up to k-levels generalization, and up to j-levels

specialization.

• Implicit lattice explored seed-outwards

Page 30: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Complexity of SVS

• O(gk) upward search, where g = # of generalization operators

• O(sj) downward search, where s = # of specialization operators

• Since m and k are constants, the SVS runs in polynomial time of order max(j,k)

• Convergence rates bounded by F(j,k)

Page 31: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Next Steps in SVS

• Implementation of transfer-rule intepreter (partially complete)

• Implementation of SVS to learn transfer rules (underway)

• Elicitation corpus extension for evaluation (under way)

• Evaluation first on Mapudungun MT (next)

Page 32: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

NICE PartnersLanguage Country Institutions

Mapudungun

(in place)

Chile Universidad de la Frontera, Institute for Indigenous Studies,

Ministry of Education

Iñupiaq

(advanced

discussion)

US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center

Siona

(discussion)

Colombia OAS-CICAD, Plante, Department of the Interior

Page 33: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Agreement Between LTI and Institute of Indigenous Studies (IEI),

Universidad De La Frontera, Chile

• Contributions of IEI– Native language knowledge and linguistic

expertise in Mapudungun– Experience in bicultural, bilingual education– Data collection: recording, transcribing,

translating– Orthographic normalization of Mapudungun

Page 34: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la

Frontera, Chile

• Contributions of LTI– Develop MT technology for indigenous

languages– Training for data collection and transcription– Partial support for data collection effort

pending funding from Chilean Ministry of Education

– International coordination, technical and project management

Page 35: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

LTI/IEI Agreement

• Continue collaboration on data collection and machine translation technology.

• Pursue focused areas of mutual interest, such as bilingual education.

• Seek additional funding sources in Chile and the US.

Page 36: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

The IEI Team• Coordinator (leader of a bilingual and multicultural education project):

– Eliseo Canulef

• Distinguished native speaker:

– Rosendo Huisca

• Linguists (one native speaker, one near-native)

– Juan Hector Painequeo

– Hugo Carrasco

• Typists/Transcribers

• Recording assistants

• Translators

• Native speaker linguistic informants

Page 37: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

MINEDUC/IEIAgreement Highlights:

Based on the LTI/IEI agreement, the Chilean Ministry of Education agreed to fund the data collection and processing team for the year 2001. This agreement will be renewed each year, as needed.

Page 38: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

MINEDUC/IEI Agreement:Objectives

To evaluate the NICE/Mapudungun proposal for orthography and spelling

To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and western.

Page 39: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

MINEDUC/IEI Agreement:Deliverables

An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect

120 hours transcribed and translated from Mapudungun to Spanish

A refined proposal for writing Mapudungun

Page 40: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Nice/Mapudungun:Database

• Writing conventions (Grafemario)• Glossary Mapudungun/Spanish• Bilingual newspaper, 4 issues• Ultimas Familias –memoirs• Memorias de Pascual Coña

– Publishable product with new Spanish translation

• 35 hours transcribed speech• 80 hours recorded speech`

Page 41: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

NICE/Mapudungun:Other Products

• Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE.

• Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data.

Page 42: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Underfunded Activities• Data collection

– Colombia (unfunded)

– Chile (partially funded)

• Travel

– More contact between CMU and Chile (UFRO) and Colombia.

• Training

– Train Mapuche linguists in language technologies at CMU.

– Extend training to Colombia

• Refine MT system for Mapudungun and Siona

– Current funding covers research on the MT engine and data collection, but not detailed linguistic analysis

Page 43: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Outline• History of MT--See Wired magazine May 2000 issue.

Available on the web.• How well does it work?• Procedure for designing an LT project.• Choose an application: What do you want to do?• Identify the properties of your application.• Methods: knowledge-based, statistical/corpus based, or

hybrid.• Methods: interlingua, transfer, direct• Typical components of an MT system.• Typical resources required for and MT system.

Page 44: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

How well does it work?Example: SpanAm

• Possibly the best Spanish-English MT system.

• Around 20 years of development.

Page 45: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

How well does it work?Example: Systran

• Try it on the Altavista web page.

• Many language pairs are available.

• Some language pairs might have taken up to a person-century of development.

• Can translate text on any topic.

• Results may be amusing.

Page 46: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

How well does it work?Example: KANT

• Translates equipment manuals for Caterpillar.• Input is controlled English: many ambiguities are

eliminated. The input is checked carefully for compliance with the rules.

• Around 5 output languages.• The output might be post-edited. • The result has to be perfect to prevent accidents

with the equipment.

Page 47: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

How well does it work?Example: JANUS

• Translates spoken conversations about booking hotel rooms or flights.

• Six languages: English, French, German, Italian, Japanese, Korean (with partners in the C-STAR consortium).

• Input is spontaneous speech spoken into a microphone.

• Output is around 60% correct.• Task Completion is higher than translation

accuracy: users can always get their flights or rooms if they are willing to repeat 40% of their sentences.

Page 48: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

How well does it work?Speech Recognition

• Jupiter weather information: 1-888-573-8255. You can say things like “what cities do you know about in Chile?” and “What will be the weather tomorrow in Santiago?”.

• Communicator flight reservations: 1-877-CMU-PLAN. You can say things like “I’m travelling to Pittsburgh.”

• Speechworks demo: 1-888-SAY-DEMO. You can say things like “Sell my shares of Microsoft.”

• These are all in English, and are toll-free only in the US, but they are speaker-indepent and should work with reasonable foreign accents.

Page 49: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Different kinds of MT

• Different applications: for example, translation of spoken language or text.

• Different methods: for example, translation rules that are hand crafted by a linguist or rules that are learned automatically by a machine.

• The work of building an MT program will be very different depending on the application and the methods.

Page 50: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Procedure for planning an MT project

• Choose an application.

• Identify the properties of your application.

• List your resources.

• Choose one or more methods.

• Make adjustments if your resources are not adequate for the properties of your application.

Page 51: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Choose an application: What do you want to do?

• Exchange email or chat in Quechua and Spanish.• Translate Spanish web pages about science into Quechua

so that kids can read about science in their language.• Scan the web: “Is there any information about such-and-

such new fertilizer and water pollution?” Then if you find something that looks interesting, take it to a human translator.

• Answer government surveys about health and agriculture (spoken or written).

• Ask directions (“where is the library?”) (spoken).• Read government publications in Quechua.

Page 52: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Identify the properties of your application.

• Do you need reliable, high quality translation?• How many languages are involved? Two or more?• Type of input.• One topic (for example, weather reports) or any topic (for

example, calling your friend on the phone to chat).• Controlled or free input.• How much time and money do you have?• Do you anticipate having to add new topics or new

languages?

Page 53: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Do you need high quality?

• Assimilation: Translate something into your language so that you can:– understand it--may not require high quality.– evaluate whether it is important or interesting

and then send it off for a better translation--does not require high quality.

– use it for educational purposes--probably requires high quality.

Page 54: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Do you need high quality?

• Dissemination: Translate something into someone else’s language e.g., for publication.

• Usually should be high quality.

Page 55: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Do you need high quality?

• Two-Way: e.g., chat room or spoken conversation

• May not require high reliability on correctness if you have a native language paraphrase.– Original input: I would like to reserve a double room.

– Paraphrase: Could you make a reservation for a double room.

Page 56: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Type of Input

• Formal text: newspaper, government reports, on-line encyclopedia.– Difficulty: long sentences

• Formal speech: spoken news broadcast.– Difficulty: speech recognition won’t be perfect.

• Conversational speech: – Difficulty: speech recognition won’t be perfect– Difficulty: disfluencies– Difficulty: non-grammatical speech

• Informal text: email, chat– Difficulty: non-grammatical speech

Page 57: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Methods: Knowledge-Based

• Knowledge-based MT: a linguist writes rules for translation:– noun adjective --> adjective noun

• Requires a computational linguist who knows the source and target languages.

• Usually takes many years to get good coverage.

• Usually high quality.

Page 58: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Methods: statistical/corpus-based• Statistical and corpus-based methods involve

computer programs that automatically learn to translate.

• The program must be trained by showing it a lot of data.

• Requires huge amounts of data.• The data may need to be annotated by hand.• Does not require a human computational linguist

who knows the source and target languages.• Could be applied to a new language in a few days.• At the current state-of-the-art, the quality is not

very good.

Page 59: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Methods: Interlingua

• An interlingua is a machine-readable representation of the meaning of a sentence.– I’d like a double room/Quisiera una habitacion doble.– request-action+reservation+hotel(room-type=double)

• Good for multi-lingual situations. Very easy to add a new language.

• Probably better for limited domains -- meaning is very hard to define.

Page 60: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Multilingual Interlingual Machine Translation

Page 61: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Methods: Transfer

• A transfer rule tells you how a structure in one language corresponds to a different structure in another language:– an adjective followed by a noun in English corresponds

to a noun followed by an adjective in Spanish.

• Not good when there are more than two languages -- you have to write different transfer rules for each pair.

• Better than interlingua for unlimited domain.

Page 62: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Methods: Direct

• Direct translation does not involve analyzing the structure or meaning of a language.

• For example, look up each word in a bilingual dictionary.

• Results can be hilarious: “the spirit is willing but the flesh is weak” can become “the wine is good, but the meat is lousy.”

• Can be developed very quickly. • Can be a good back-up when more complicated

methods fail to produce output.

Page 63: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Components of a Knowledge-Based Interlingua MT System

• Morphological analyzer: identify prefixes, suffixes, and stem.

• Parser (sentence-to-syntactic structure for source language, hand-written or automatically learned)

• Meaning interpreter (syntax-to-semantics, source language).

• Meaning interpreter (semantics-to-syntax, target language).

• Generator (syntactic structure-to-sentence) for target language.

Page 64: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Resources for a knowledge-based interlingua MT system

• Computational linguists who know the source and target languages.

• As large a corpus as possible so that the linguists can confirm that they are covering the necessary constructions, but the size of the corpus is not crucial to system development.

• Lexicons for source and target languages, syntax, semantics, and morphology.

• A list of all the concepts that can be expressed in the system’s domain.

Page 65: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Components of Example Based MT: a direct statistical method

• A morphological analyzer and part of speech tagger would be nice, but not crucial.

• An alignment algorithm that runs over a parallel corpus and finds corresponding source and target sentences.

• An algorithm that compares an input sentence to sentences that have been previously translated, or whose translation is known.

• An algorithm that pulls out the corresponding translation, possibly slightly modifying a previous translation.

Page 66: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Resources for Example Based MT

• Lexicons would improve quality of translation, but are not crucial.

• A large parallel corpus (hundreds of thousands of words).

Page 67: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

“Omnivorous” Multi-Engine MT: eats any available resources

Page 68: Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Approaches we had in mind

• Direct bilingual-dictionary lookup: because it is easy and is a back-up when other methods fail.

• Generalized Example-Based MT: because it is easy and fast and can be also be a back-up.

• Instructable Transfer-based MT: a new, untested idea involving machine learning of rules from a human native speaker. Useful when computational linguists don’t know the language, and people who know the language are not computational linguists.

• Conventional, hand-written transfer rules: in case the new method doesn’t work.