Gina-Anne Levow University of Chicago July 7, 2003

13
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003

description

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words. Gina-Anne Levow University of Chicago July 7, 2003. Roadmap. Goals of expansion Expansion points in CL-SDR Pre- and Post-translation document expansion experiments - PowerPoint PPT Presentation

Transcript of Gina-Anne Levow University of Chicago July 7, 2003

Page 1: Gina-Anne Levow University of Chicago July 7, 2003

Issues in Pre- and Post-translation Document Expansion:

Untranslatable Cognates and Missegmented Words

Gina-Anne Levow

University of Chicago

July 7, 2003

Page 2: Gina-Anne Levow University of Chicago July 7, 2003

Roadmap

• Goals of expansion– Expansion points in CL-SDR

• Pre- and Post-translation document expansion experiments– Task, query & document processing– Expansion methodology

• Results

• Discussion & Conclusions

Page 3: Gina-Anne Levow University of Chicago July 7, 2003

Why Expansion?

• Recover terms that could have appeared– Compensate for difference in term choice

• Author concepts vs searcher information need

– Compensate for noisy processing• ASR transcription errors

– Misrecognitions, deletions, missegmentations

• Translation errors– Gaps, missegmentations

– Context disambiguates

Page 4: Gina-Anne Levow University of Chicago July 7, 2003

Expansion Opportunities

• Query: – (Ballesteros & Croft’96; McNamee & Mayfield 2002)

– Before, after translation; both– Different enhancements to precision/recall– Pre-translation key – something to translate

• European languages

• Document– Before, after translation; both– Developed for monolingual SDR (Singhal 1999)– CLIR (+SDR) (Levow & Oard 2000)

• Post-translation promising

Page 5: Gina-Anne Levow University of Chicago July 7, 2003

Experimental Configuration: Basic Task

• Variant of Topic Detection and Tracking (TDT)– English queries to Mandarin documents

• Query-by-example– English newswire or broadcast news stories

• Mandarin audio broadcast news documents– Automatically transcribed by Dragon ASR system

– Modifications:• Retrospective retrieval• Evaluation metric: Mean Average Precision

Page 6: Gina-Anne Levow University of Chicago July 7, 2003

Experimental Configuration:Query and Document Processing

• Query:– Select top 180 positively correlated terms in 4 exemplars

• Based on Χ^2 test• 996 prior documents assumed not relevant

• Document:– Dictionary-based word-for-word translation

• Segmentation: NMSU ch_seg• Translation resource:

– Merged bilingual term list: CETA & LDC term list

• Translation ranking:– Target language unigram frequency: single words, multi-word

Page 7: Gina-Anne Levow University of Chicago July 7, 2003

Experimental Configuration:Document Expansion

Page 8: Gina-Anne Levow University of Chicago July 7, 2003

Document Expansion: Details

• Side collections:– Mandarin: TDT-2 Xinhua, Zaobao newswire– English: TDT-2 New York Times, AP news

• Expansion term selection– Top 5 documents– Sort candidate terms by idf– Exclude terms in only one document– Add one term instance per document– Add until document doubled in length

Page 9: Gina-Anne Levow University of Chicago July 7, 2003

Results

• Post-translation significantly outperforms pre-translation expansion

None Pre Post Pre+Post

0.39 0.46 0.59 0.61

Page 10: Gina-Anne Levow University of Chicago July 7, 2003

Discussion: Post-translation Effectivenes

• Post-translation document expansion significantly improves retrieval effectiveness– Little improvement from pre-translation expans’n

• Either alone or in conjunction

• Expansion introduces key enriching terms– Named entities, alternate forms

• E.g. Tariq Aziz, Saddam, Yeltsin, etc

– Available in English (post-translation) collection

Page 11: Gina-Anne Levow University of Chicago July 7, 2003

Discussion: Pre-translation Limitations

• Expansion terms do not exist– Segmentation & transcription rely on term lists

• Named entities frequently absent• Can not extract terms from Mandarin newswire

• Expansion terms can not translate– Key terms (e.g. named entities) absent from

bilingual term lists• All examples on previous page absent

Page 12: Gina-Anne Levow University of Chicago July 7, 2003

Discussion: Contrasts

• Contradict prior query expansion results– Re: Primacy of pre-translation expansion

• Explanation:– Prior languages – mostly European

• Common writing system, white-space delimited• Pre-translation expansion produces

– -> translatable terms + (possibly) untranslatable cognates– Cognates still match, even without translation

– Current experiment: English-Mandarin• Untranslatable cognates useless

– Different orthography

• Terms not identified - missegmentation

Page 13: Gina-Anne Levow University of Chicago July 7, 2003

Conclusion

• Document expansion improves effectiveness– For CL-SDR case, recovers terms lost by missegmentation,

mistranscription, or mistranslation; supports different terms

• Post-translation expansion most effective– Translated terms provide context for retrieval

• Correct translations/transcriptions coherent; others noise

– Enriching terms often absent from term lists• Segmentation, transcription, translation all rely on lists

– Expansion in indexing language bypasses barriers• Crucial in languages with segmentation issues and different forms