Semantics in StatisticalMachine Translation
Jan Odijk
MA-Rotation Lecture
Utrecht March10, 2011
1
Overview
• Machine Translation (MT)• Rule-based MT• Statistical MT• Hybrid MT
2
MT: What is it?
• Input: text in source language• Output text in target language that is a
translation of the input text
3
MT: What is it?
Interlingua
Analyzed input transfer Analyzed output
Input direct translation Output 4
MT: System Types
• Direct:– Earliest systems (1950s)
• Direct word-to-word translation
– Recent statistical MT systems
• Transfer– Almost all research and commercial systems <=
1990
• Interlingual5
MT: System Types
• Interlingual– A few research systems in the 1980s
• Rosetta (Philips), based on Montague Grammar– Semantic derivation trees of attuned grammars
• Distributed Translation (BSO)– (enriched) Esperanto
• Sometimes logical representations
• Hybrid Interlingual/Transfer– Transfer for lexicons; IL for rules
6
Rule-Based Systems
• Most systems– explicit source language grammar– parser yields analysis of source language input– transfer component turns it into target language
structure– no explicit grammar of target language (except
morphology)
7
Rule-Based Systems
• Some systems (Eurotra)– explicit source and target language grammar
• sometimes reversible
– parser yields analysis of source language input– transfer component turns it into target language
structure– generation of translation by target language
grammar
8
Rule-Based Systems
• Some systems (Rosetta, DLT)– explicit source and target language grammar
• in some cases reversible
– parser yields interlingual representation– generation of translation by target language
grammar from interlingual representation
9
MT: Is it difficult?
• FAHQT: Fully Automatic High Quality Translation– Fully Automatic: no human intervention– High Quality: close or equal to human
translation
• Even acceptable quality is difficult to achieve
10
MT: Problems
• Ambiguity– Real
• Cannot be resolved by grammar• Is much higher than a human can imagine!• Require world knowledge modeling or statistics
– Temporary• Are resolved by the grammar but require large
computational resources
11
MT: Problems
• Computational Complexity– Most rule based systems with a context-free
base (O(n3)) plus extensions (O(?))– Require large computational resources– Require large memory resources– Sentences with length > 20 hardly processable
12
MT: Problems
• Complexity of language– Many different construction types– All interacting with each other– Full coverage is hard to achieve often fall
back on robustness measures– For many constructions proper analysis is not
known– Theoretical linguistics is not going to help
because of focus on explanatory adequacy13
MT: Problems
• Divergences between languages– Lexical categorial:
• zich ergeren v. (be) annoyed (Verb-Adj)• hij zwemt graag vs. he likes to swim
– Phrasal categorial• I expect her to leave
– ik verwacht dat zij vertrekt
• She is likely to come– het is waarschijnlijk dat zij komt
14
Conflational Divergences:
• prepositional complements– houden van vs. love
• existential er vs. Ø– er passeerde een auto vs.– a car passed
• verbal particles– blow (something) up vs. volar
15
Conflational Divergences:
• reflexive verbs– zich scheren vs. shave
• composed vs. simple tense forms – he will do it vs. lo hará
• split negatives vs. composed negatives– he does not see anyone vs.– hij ziet niemand
16
Functional Divergences:
• I like these apples– me gustan estas manzanas
• se venden manzanas aqui– hier verkoopt men appels
• er werd door de toeschouwers gejuicht– the spectators were cheering
17
Divergences: MWEs
• semi-fixed MWEs– nuclear power plant vs. kerncentrale
• flexible idioms– de plaat poetsen vs. bolt– de pijp uit gaan v. to kick the bucket
18
Divergences: MWEs
• semi-idioms (collocations)– zware shag vs. strong tobacco
• semi-idioms (support verbs)– aandacht besteden aan– pay attention to
19
MT: Why is it so difficult?
• Language Competence v. Language Use– Earlier research systems implemented idealized
reality– But not the really occurring language use– In some cases
• focus on theoretically interesting difficult constructions (that do occur in reality)
• But other constructions are more important to deal with in practical systems
20
MT: Why is it so difficult?
• Large and rich lexicons– Existing human-oriented dictionaries are not
suited as such– All information must be available in a
formalized way– Much more information is needed than in a
traditional dictionary
21
MT: Why is it so difficult?
• Multi-word Expressions (MWEs)– Are in current dictionaries only in a very
informal way– No standards on how to represent them
lexically– Many different types requiring different
treatment in the grammar– Huge numbers!!– Domain and company-specific terminology are
often MWEs
22
MT: Why is it so difficult?
• All systems must make approximations: – Ignore certain ambiguities to begin with– Use only limited amount of relevant
information – Cut off analysis when there are too many
alternatives
23
Statistical MT
• Statistical MT • Derives MT-system automatically
– From statistics taken from• Aligned parallel corpora ( translation model)• Monolingual target language corpora ( language
model)• Being worked since early 90’s• Paradigm originates in speech recognition
(and these in noisy channel models)24
MT: Can we make it possible?
• Plus:– No or very limited grammar development– Includes language and world knowledge automatically
(but implicitly)– Based on actually occurring data– Currently many experimental and commercial systems
• Minus:– Requires large aligned parallel corpora– Clearly has problems with longer span dependencies
25
Statistical MT
• Google Translate (statistical MT)• Hij draagt een pak. √He wears a suit.• Hij draagt schoenen. √ He wears shoes.• Hij draagt bruine schoenen en een pak.
• √ He wears a suit and brown shoes. (!!)• Hij draagt het pakket √ He carries the package• Hij heeft een pak aan. *He has a suit.• Voert uw bedrijf sloten uit?
– *Does your company locks out?
26
Hybrid MT:
• Can we somehow combine the strengths of rule-based approaches and the statistical approaches– And avoid their disadvantages?
• Active Research area– Several projects
27
Hybrid MT
• Euromatrix esp. “the Euromatrix”– Lists data and tools for European language pairs– Goals
• Translation systems for all pairs of EU languages• Organization, analysis and interpretation of a competitive annual international
evaluation of machine translation • The provision of open source machine translation technology including
research tools, software and data• A systematically compiled and constantly updated detailed survey of the state
of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation• The development and testing of hybrid architectures for the integration of
rule-based and statistical approaches• Successor project EuromatrixPlus
28
Hybrid MT
• PACO-MT 2008-2011• Investigates hybrid approach to MT
– Rule-based and statistical– Uses existing parser for source language
analysis– Uses statistical n-gram language models for
generation– Uses statistical approach to transfer
29
Hybrid MT
• META-NET 2010-2013 (EU-funding)– Building a community with shared vision and strategic
research agenda– Building META-SHARE, an open resource exchange
facility– Building bridges to neighbouring technology fields
• Bringing more Semantics into Translation• Optimising the Division of Labour in Hybrid MT• Exploiting the Context for Translation• Empirical Base for Machine Translation
30
Hybrid MT
• Bringing more Semantics into Translation– Charles University Prague (Jan Hajic)– FBK-Irst, Trento (Marcello Federico)– UiL-OTS, Utrecht (Christer Samuelsson)
• currently orienting ourselves and trying to determine a concrete topic for investigation
31
Hybrid MT: Semantics
• Possible Topics:– lexical semantics and their resources / Word
Sense Disambiguation– knowledge representations– multiword expressions– Syntactic and semantic dependencies /
Semantic Role Labeling– Discourse structure– Co-reference resolution– Recognizing Textual Entailment and MT
Evaluation32
Semantics resources
• Lexical Semantics– Resources: WordNet, EuroWordNet, BalkaNet,
WordNets for several languages– Knowledge Repositories:
• OpenCyc, Wikipedia, DBpedia
• MWE Lexica: SAID, DUELME
33
Semantics Resources
• CoNLL 2009 Shared Task on syntactic and semantic dependencies– training and development data – evaluation data
• PennDiscource TreeBank
34
Hybrid MT
• Tools:• SRL and Semantic Parsing: SWIRL ,
ASSERT , SENNA, C&C (all for Eng), tools developed at LUND University (for Eng and Chn)
35
Semantics Resources
• Tools:• Co-Reference and Anaphora Resolution:
– BART (Eng), – COREA (Dut)
• NER: – BIOS (Eng)
36
Top Related