Download - Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Semantics in StatisticalMachine Translation

Jan Odijk

MA-Rotation Lecture

Utrecht March10, 2011

1

Overview

• Machine Translation (MT)• Rule-based MT• Statistical MT• Hybrid MT

2

MT: What is it?

• Input: text in source language• Output text in target language that is a

translation of the input text

3

MT: What is it?

Interlingua

Analyzed input transfer Analyzed output

Input direct translation Output 4

MT: System Types

• Direct:– Earliest systems (1950s)

• Direct word-to-word translation

– Recent statistical MT systems

• Transfer– Almost all research and commercial systems <=

1990

• Interlingual5

MT: System Types

• Interlingual– A few research systems in the 1980s

• Rosetta (Philips), based on Montague Grammar– Semantic derivation trees of attuned grammars

• Distributed Translation (BSO)– (enriched) Esperanto

• Sometimes logical representations

• Hybrid Interlingual/Transfer– Transfer for lexicons; IL for rules

6

Rule-Based Systems

• Most systems– explicit source language grammar– parser yields analysis of source language input– transfer component turns it into target language

structure– no explicit grammar of target language (except

morphology)

7

Rule-Based Systems

• Some systems (Eurotra)– explicit source and target language grammar

• sometimes reversible

– parser yields analysis of source language input– transfer component turns it into target language

structure– generation of translation by target language

grammar

8

Rule-Based Systems

• Some systems (Rosetta, DLT)– explicit source and target language grammar

• in some cases reversible

– parser yields interlingual representation– generation of translation by target language

grammar from interlingual representation

9

MT: Is it difficult?

• FAHQT: Fully Automatic High Quality Translation– Fully Automatic: no human intervention– High Quality: close or equal to human

translation

• Even acceptable quality is difficult to achieve

10

MT: Problems

• Ambiguity– Real

• Cannot be resolved by grammar• Is much higher than a human can imagine!• Require world knowledge modeling or statistics

– Temporary• Are resolved by the grammar but require large

computational resources

11

MT: Problems

• Computational Complexity– Most rule based systems with a context-free

base (O(n3)) plus extensions (O(?))– Require large computational resources– Require large memory resources– Sentences with length > 20 hardly processable

12

MT: Problems

• Complexity of language– Many different construction types– All interacting with each other– Full coverage is hard to achieve often fall

back on robustness measures– For many constructions proper analysis is not

known– Theoretical linguistics is not going to help

because of focus on explanatory adequacy13

MT: Problems

• Divergences between languages– Lexical categorial:

• zich ergeren v. (be) annoyed (Verb-Adj)• hij zwemt graag vs. he likes to swim

– Phrasal categorial• I expect her to leave

– ik verwacht dat zij vertrekt

• She is likely to come– het is waarschijnlijk dat zij komt

14

Conflational Divergences:

• prepositional complements– houden van vs. love

• existential er vs. Ø– er passeerde een auto vs.– a car passed

• verbal particles– blow (something) up vs. volar

15

Conflational Divergences:

• reflexive verbs– zich scheren vs. shave

• composed vs. simple tense forms – he will do it vs. lo hará

• split negatives vs. composed negatives– he does not see anyone vs.– hij ziet niemand

16

Functional Divergences:

• I like these apples– me gustan estas manzanas

• se venden manzanas aqui– hier verkoopt men appels

• er werd door de toeschouwers gejuicht– the spectators were cheering

17

Divergences: MWEs

• semi-fixed MWEs– nuclear power plant vs. kerncentrale

• flexible idioms– de plaat poetsen vs. bolt– de pijp uit gaan v. to kick the bucket

18

Divergences: MWEs

• semi-idioms (collocations)– zware shag vs. strong tobacco

• semi-idioms (support verbs)– aandacht besteden aan– pay attention to

19

MT: Why is it so difficult?

• Language Competence v. Language Use– Earlier research systems implemented idealized

reality– But not the really occurring language use– In some cases

• focus on theoretically interesting difficult constructions (that do occur in reality)

• But other constructions are more important to deal with in practical systems

20


• Large and rich lexicons– Existing human-oriented dictionaries are not

suited as such– All information must be available in a

formalized way– Much more information is needed than in a

traditional dictionary

21


• Multi-word Expressions (MWEs)– Are in current dictionaries only in a very

informal way– No standards on how to represent them

lexically– Many different types requiring different

treatment in the grammar– Huge numbers!!– Domain and company-specific terminology are

often MWEs

22


• All systems must make approximations: – Ignore certain ambiguities to begin with– Use only limited amount of relevant

information – Cut off analysis when there are too many

alternatives

23

Statistical MT

• Statistical MT • Derives MT-system automatically

– From statistics taken from• Aligned parallel corpora ( translation model)• Monolingual target language corpora ( language

model)• Being worked since early 90’s• Paradigm originates in speech recognition

(and these in noisy channel models)24

MT: Can we make it possible?

• Plus:– No or very limited grammar development– Includes language and world knowledge automatically

(but implicitly)– Based on actually occurring data– Currently many experimental and commercial systems

• Minus:– Requires large aligned parallel corpora– Clearly has problems with longer span dependencies

25

Statistical MT

• Google Translate (statistical MT)• Hij draagt een pak. √He wears a suit.• Hij draagt schoenen. √ He wears shoes.• Hij draagt bruine schoenen en een pak.

• √ He wears a suit and brown shoes. (!!)• Hij draagt het pakket √ He carries the package• Hij heeft een pak aan. *He has a suit.• Voert uw bedrijf sloten uit?

– *Does your company locks out?

26

http://translate.google.com/#nl%7Cen%7C

Hybrid MT:

• Can we somehow combine the strengths of rule-based approaches and the statistical approaches– And avoid their disadvantages?

• Active Research area– Several projects

27

Hybrid MT

• Euromatrix esp. “the Euromatrix”– Lists data and tools for European language pairs– Goals

• Translation systems for all pairs of EU languages• Organization, analysis and interpretation of a competitive annual international

evaluation of machine translation • The provision of open source machine translation technology including

research tools, software and data• A systematically compiled and constantly updated detailed survey of the state

of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation• The development and testing of hybrid architectures for the integration of

rule-based and statistical approaches• Successor project EuromatrixPlus

28

http://www.euromatrix.net/

http://www.euromatrix.net/euromatrix

http://www.euromatrix.net/euromatrix

http://www.euromatrixplus.net/

Hybrid MT

• PACO-MT 2008-2011• Investigates hybrid approach to MT

– Rule-based and statistical– Uses existing parser for source language

analysis– Uses statistical n-gram language models for

generation– Uses statistical approach to transfer

29

http://www.ccl.kuleuven.be/Projects/PACO/paco.php

Hybrid MT

• META-NET 2010-2013 (EU-funding)– Building a community with shared vision and strategic

research agenda– Building META-SHARE, an open resource exchange

facility– Building bridges to neighbouring technology fields

• Bringing more Semantics into Translation• Optimising the Division of Labour in Hybrid MT• Exploiting the Context for Translation• Empirical Base for Machine Translation

30

http://www.meta-net.eu/

Hybrid MT

• Bringing more Semantics into Translation– Charles University Prague (Jan Hajic)– FBK-Irst, Trento (Marcello Federico)– UiL-OTS, Utrecht (Christer Samuelsson)

• currently orienting ourselves and trying to determine a concrete topic for investigation

31

Hybrid MT: Semantics

• Possible Topics:– lexical semantics and their resources / Word

Sense Disambiguation– knowledge representations– multiword expressions– Syntactic and semantic dependencies /

Semantic Role Labeling– Discourse structure– Co-reference resolution– Recognizing Textual Entailment and MT

Evaluation32

Semantics resources

• Lexical Semantics– Resources: WordNet, EuroWordNet, BalkaNet,

WordNets for several languages– Knowledge Repositories:

• OpenCyc, Wikipedia, DBpedia

• MWE Lexica: SAID, DUELME

33

http://wordnet.princeton.edu/

http://www.illc.uva.nl/EuroWordNet/

http://www.ist-world.org/ProjectDetails.aspx?ProjectId=78860940eb8e439583483b88f125164b

http://www.globalwordnet.org/gwa/wordnet_table.htm

http://www.globalwordnet.org/gwa/wordnet_table.htm

http://www.opencyc.org/

http://www.wikipedia.org/

http://dbpedia.org/About

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T10

http://duelme.inl.nl/

Semantics Resources

• CoNLL 2009 Shared Task on syntactic and semantic dependencies– training and development data – evaluation data

• PennDiscource TreeBank

34

http://ufal.mff.cuni.cz/conll2009-st/train-dev-data.html

http://ufal.mff.cuni.cz/conll2009-st/eval-data.html

http://www.seas.upenn.edu/~pdtb/



Hybrid MT

• Tools:• SRL and Semantic Parsing: SWIRL ,

ASSERT , SENNA, C&C (all for Eng), tools developed at LUND University (for Eng and Chn)

35

http://www.surdeanu.name/mihai/swirl

http://cemantix.org/assert.html

http://ml.nec-labs.com/senna/

http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer

http://nlp.cs.lth.se/software/

Semantics Resources

• Tools:• Co-Reference and Anaphora Resolution:

– BART (Eng), – COREA (Dut)

• NER: – BIOS (Eng)

36

http://www.bart-coref.org/

http://www.cnts.ua.ac.be/~hoste/corea.html

http://www.surdeanu.name/mihai/bios/