Simultaneous Multilingual Search for Translingual Information Retrieval Kristen Parton 1 Kathleen...
-
Upload
jack-campbell -
Category
Documents
-
view
213 -
download
0
Transcript of Simultaneous Multilingual Search for Translingual Information Retrieval Kristen Parton 1 Kathleen...
Simultaneous Multilingual Search for Translingual Information Retrieval
Kristen Parton1
Kathleen McKeown1
James Allan2
Enrique Henestroza1
2
1
Motivation: Cross-Lingual IR
DocumentsQuery in User Language
Search Resultsin Document Language(s)
User needs to search documents in other languages
stereotypes of Arabs
الله العبد رانيا الملكةعن النمطية الصورة تناقش
العرب
Task Redefinition: Translingual IR
DocumentsQuery in User Language
Search Resultsin User Language
User needs to search documents in other languages and get back translated results
stereotypes of Arabs
Queen Rania Al-Abdullah discusses stereotypes of Arabs
Task Redefinition: Translingual IR User needs to search documents in other
languages and get back translated results
For translingual applications, integrating CLIR and result translation can improve both relevance and translation quality
Outline
Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work
Approaches to CLIR
Map query and/or documents to common representation
Schwarzenegger
Doc1 Doc2 Doc3
ايضا هو شوارزنجر ان يذكراألوليمبية للحركة نصير
...الخاصة
والية ... وحاكم النجم جانب الىشوارزنيجر ارنولد . كاليفورنيا
التي االقتراحات كل فشلفي شوارزينغر عرضها
استفتاء
Approaches to CLIR
Map query and/or documents to common representation Document translation (DT) + pre-translation query expansion
SchwarzeneggerSchwarzneggerSchwartzenegger...
It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement […]
… besides the star and the governor of the state of California Arnold Schwarznegger .
The failure of all proposals made by Schwarzenegger in a referendum
Doc1 Doc2 Doc3
Approaches to CLIR
Map query and/or documents to common representation Document translation (DT) + pre-translation query expansion Query translation (QT) + post-translation query expansion
SchwarzeneggerSchwarzneggerSchwartzenegger...
شفارتزنيغرشوارزنجرشوارزنيجرشوارزينيجر
ايضا هو شوارزنجر ان يذكراألوليمبية للحركة نصير
...الخاصة
والية ... وحاكم النجم جانب الىشوارزنيجر ارنولد . كاليفورنيا
التي االقتراحات كل فشلفي شوارزينغر عرضها
استفتاء
Doc1 Doc2 Doc3
Approaches to CLIR
Map query and/or documents to common representation Document translation (DT) + pre-translation query expansion Query translation (QT) + post-translation query expansion
SchwarzeneggerSchwarzneggerSchwartzenegger...
شفارتزنيغرشوارزنجرشوارزنيجرشوارزينيجر
ان ايضا شوارزنجريذكر هواألوليمبية للحركة نصير
...الخاصة
والية ... وحاكم النجم جانب الىارنولد . شوارزنيجركاليفورنيا
التي االقتراحات كل فشلفي شوارزينغرعرضها استفتاء
Doc1 Doc2 Doc3
Query Translation vs. Document Translation Trade-offs
Translation resources Approximate DT [Oard 00], [Chen 04]
Translation quality Handling synonymy
Hybrid methods [McCarley 99], [Chen & Gey 04]: Run QT and DT searches,
merge results and rerank [Wang & Oard 06]: Use bidirectional word alignments to
capture information from QT and DT
Hybrid Merged Method
Merge and re-rank results of two searches [McCarley 99] DT: Query + indexed document translations QT: Translated query + indexed source documents
Problems Different document lengths, query lengths Raw IR scores not comparable across queries Many ways to re-rank, merge searches
DT Score QT Score Average docid
0 0.5 0.25 Doc1
0.9 0.5 0.7 Doc2
0.8 0 0.4 Doc3
Doc2
Doc3
Doc1
Merged Results
Outline
Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work
Simultaneous Multilingual IR (SMLIR) Indexed document: source + document translation Query: original query + query translations (+expansions)
ان ايضا شوارزنجريذكر هواألوليمبية للحركة نصير
...الخاصة
والية ... وحاكم النجم جانب الىارنولد . شوارزنيجركاليفورنيا
It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement […]
… besides the star and the governor of the state of California Arnold Schwarznegger .
شوارزينيجر شوارزنيجر شوارزنجر شفارتزنيغر
Query:
التي االقتراحات كل فشلفي شوارزينغرعرضها استفتاء
The failure of all proposals made by Schwarzenegger in a referendum
Doc1 Doc2 Doc3
Schwarzenegger Schwarznegger
Simultaneous Multilingual IR (SMLIR) Multilingual (probabilistic) structured queries
Treat query term and its translations as synonyms
SMLIR Hybrid vs. Merged Hybrid No need for re-ranking or raw score normalization Single index, one search Query time comparable to Merged in practice
)(
)()()(wtransx
jjj xTFwTFwFT
)(
)()()(wtransx
jjj xDFwDFwFD
Outline
Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work
Relevance: Lost in Translation
Statistical MT makes mistakes Bad translations of relevant documents may be
perceived as irrelevant
Detection: IR match in source language but not in document translation → Bad translation?
Correction: Replace bad translation with query term
العراقية وكانتالريشاوي ... اوقفت ساجدة
It was the Iraqi sajidah Alry$Awy had stopped…
Sajida al-Rishawi الريشاوي ساجدة
Query-Directed MT Post-Editing Use query translation + word alignments to rewrite
incorrect machine translation (MT)
Considerations: errors in query translation, incorrect word alignments
It was the Iraqi Sajida al-Rishawi had stopped…
Translated document with word alignments
Edited translation
العراقية وكانتالريشاوي ... اوقفت ساجدة
It was the Iraqi sajidah Alry$Awy had stopped…
Sajida al-Rishawi الريشاوي ساجدة
Outline
Approaches to CLIR SMLIR for Translingual IR Query-Directed MT Post-Editing System Evaluation Conclusions and Future Work
Experiment Setup Part of Darpa GALE question-answering task
WHERE HAS [UN Secretary General Kofi Annan] BEEN AND WHEN? Multilingual: English, Chinese, Arabic Multimodal: speech, text; Multigenre: formal, informal
Evaluation Corpus 102,859 Chinese documents Translated into English using RWTH statistical machine
translation system Searches run using Indri (Lemur) IR system
Relevance judgments 145 queries, 8,785 documents judged A document is Relevant or Not Relevant for a query Judgments based on Chinese text, by Chinese native speakers
Evaluation Points
1. Query Translation Strategies English query Chinese query Run SMLIR searches, evaluate results
2. Cross-lingual IR Approaches Using Chinese and/or English query, search over Chinese
and/or translated documents
3. Machine Translation Post-Editing Detect errors in result translations Rewrite translations
Query Translation for SMLIR
GALE queries are name-centric Statistical machine translation (SMT) failed to translate
many names in corpus Wikipedia for name translation [Ferrandez et al. 07]
Generated by humans, “edited” by humans Contains slang, name variations, common misspellings Noisy, some intentional spam Large variation in quantity/quality by language
User-Generated “Synonyms” and TranslationsEnglish Query English Redirects Cross-Language
LinksArabic Redirects
mahmoud abbas abu mazenmahmud 'abbasmahmud abbasabbas, mahmoud
عباس محمود عباس محمودمازن أبو
kofi annan annan, kofikofikofi a annankofi atta annankofi bo bofinana maria annan
عنان كوفي عنان كوفيانان كوفيأنان كوفي
arnoldschwarzenegger
(49 variants)
ahnuld
governator
arnold swarzenager
arnold swarzenneger
arnold swartzeneger
…
آرنولد شوارزنيجر
آرنولد شوارزنيجر
Query Translation Strategies for SMLIR
0.30
0.35
0.40
0.45
0.50
0.55
0.60
MT Dictionary Wikipedia Wikipedia + MTDictionary
ND
CG
at
10
MT dictionary: probabilistic translation dictionary derived from word alignments
Wikipedia: for name translations; not probabilistic
Combination did not help?
CLIR Evaluation
0.30
0.35
0.40
0.45
0.50
0.55
0.60
QueryTranslation
DocumentTranslation
MergedHybrid
SMLIRHybrid
ND
CG
at
10
SMLIR significantly outperforms all
DT significantly better than QT
Poor performance of QT degrades Merged
Results: Query-Directed SMT Post-Editing Post-Editing
Detect possible incorrect name translations If translated name is not a synonym of query, rewrite name Very conservative algorithm; does not handle deletions
Experiment 127 queries, top 10 documents 28 queries triggered post-editing 15% of name matches were rewritten
Evaluation 101 rewrites examined; 93% Acceptable, 6% Not Acceptable
Conclusions
SMLIR: Novel and effective approach for integrating document and query translation in CLIR
Query-directed SMT post-editing shows promise More sophisticated editing possible, beyond just names
Future work: evaluating whole system for end-to-end question answering
Combining CLIR and machine translation can improve both search relevance and translation accuracy
Thank you! This work was supported in part by the Defense Advanced Research
Projects Agency (DARPA) under contract number HR0011-06-C-0023, in part by an NSF Graduate Research Fellowship, and in part by the Center for Intelligent Information Retrieval at the University of Massachusetts.
Thanks very much to Bob Armstrong for making the annotation happen. Thanks also to Mark Smucker and Giridhar Kumaran for help with INDRI interface and corpus issues, and Ben Carterette for help with estimated MAP. We would also like to thank the members of the NIGHTINGALE machine translation team for translation data, especially Nizar Habash and Mahmoud Ghoneim.