Gina-Anne Levow University of Chicago July 7, 2003
description
Transcript of Gina-Anne Levow University of Chicago July 7, 2003
Issues in Pre- and Post-translation Document Expansion:
Untranslatable Cognates and Missegmented Words
Gina-Anne Levow
University of Chicago
July 7, 2003
Roadmap
• Goals of expansion– Expansion points in CL-SDR
• Pre- and Post-translation document expansion experiments– Task, query & document processing– Expansion methodology
• Results
• Discussion & Conclusions
Why Expansion?
• Recover terms that could have appeared– Compensate for difference in term choice
• Author concepts vs searcher information need
– Compensate for noisy processing• ASR transcription errors
– Misrecognitions, deletions, missegmentations
• Translation errors– Gaps, missegmentations
– Context disambiguates
Expansion Opportunities
• Query: – (Ballesteros & Croft’96; McNamee & Mayfield 2002)
– Before, after translation; both– Different enhancements to precision/recall– Pre-translation key – something to translate
• European languages
• Document– Before, after translation; both– Developed for monolingual SDR (Singhal 1999)– CLIR (+SDR) (Levow & Oard 2000)
• Post-translation promising
Experimental Configuration: Basic Task
• Variant of Topic Detection and Tracking (TDT)– English queries to Mandarin documents
• Query-by-example– English newswire or broadcast news stories
• Mandarin audio broadcast news documents– Automatically transcribed by Dragon ASR system
– Modifications:• Retrospective retrieval• Evaluation metric: Mean Average Precision
Experimental Configuration:Query and Document Processing
• Query:– Select top 180 positively correlated terms in 4 exemplars
• Based on Χ^2 test• 996 prior documents assumed not relevant
• Document:– Dictionary-based word-for-word translation
• Segmentation: NMSU ch_seg• Translation resource:
– Merged bilingual term list: CETA & LDC term list
• Translation ranking:– Target language unigram frequency: single words, multi-word
Experimental Configuration:Document Expansion
Document Expansion: Details
• Side collections:– Mandarin: TDT-2 Xinhua, Zaobao newswire– English: TDT-2 New York Times, AP news
• Expansion term selection– Top 5 documents– Sort candidate terms by idf– Exclude terms in only one document– Add one term instance per document– Add until document doubled in length
Results
• Post-translation significantly outperforms pre-translation expansion
None Pre Post Pre+Post
0.39 0.46 0.59 0.61
Discussion: Post-translation Effectivenes
• Post-translation document expansion significantly improves retrieval effectiveness– Little improvement from pre-translation expans’n
• Either alone or in conjunction
• Expansion introduces key enriching terms– Named entities, alternate forms
• E.g. Tariq Aziz, Saddam, Yeltsin, etc
– Available in English (post-translation) collection
Discussion: Pre-translation Limitations
• Expansion terms do not exist– Segmentation & transcription rely on term lists
• Named entities frequently absent• Can not extract terms from Mandarin newswire
• Expansion terms can not translate– Key terms (e.g. named entities) absent from
bilingual term lists• All examples on previous page absent
Discussion: Contrasts
• Contradict prior query expansion results– Re: Primacy of pre-translation expansion
• Explanation:– Prior languages – mostly European
• Common writing system, white-space delimited• Pre-translation expansion produces
– -> translatable terms + (possibly) untranslatable cognates– Cognates still match, even without translation
– Current experiment: English-Mandarin• Untranslatable cognates useless
– Different orthography
• Terms not identified - missegmentation
Conclusion
• Document expansion improves effectiveness– For CL-SDR case, recovers terms lost by missegmentation,
mistranscription, or mistranslation; supports different terms
• Post-translation expansion most effective– Translated terms provide context for retrieval
• Correct translations/transcriptions coherent; others noise
– Enriching terms often absent from term lists• Segmentation, transcription, translation all rely on lists
– Expansion in indexing language bypasses barriers• Crucial in languages with segmentation issues and different forms