Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

88
Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Page 1: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Cross-Language Retrieval

LBSC 796/INFM 718R

Douglas W. Oard

Session 12: November 26, 2007

Page 2: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Agenda

• Questions

• Overview

• Cross-Language Search

• User Interaction

Page 3: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

User Needs Assessment

• Who are the potential users?

• What goals do we seek to support?

• What language skills must we accommodate?

Page 4: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

31%

18%

9%

7%

7%

5%

4%

3%

3%

2%

11%

English

Chinese

Japanese

Spanish

German

Korean

French

Portuguese

Italian

Russian

Other

Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

Global Internet Users

Page 5: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

68%

4%

6%

2%

6%

1%

3%1%

2%2%

5%

31%

18%

9%

7%

7%

5%

4%

3%

3%

2%

11%

English

Chinese

Japanese

Spanish

German

Korean

French

Portuguese

Italian

Russian

Other

Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

Global Internet Users

Page 6: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Most Widely-Spoken Languages

0

100

200

300

400

500

600

700

800

900

1000

Chinese

Englis

h

Spanis

h

Russian

Frenc

h

Portu

gues

e

Arabic

Benga

li

Hindi/

Urdu

Japa

nese

Germ

an

Nu

mb

er o

f S

pea

kers

(m

illi

on

s)Secondary

Primary

Source: Ethnologue (SIL), 1999

Page 7: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Global Trade

0

200

400

600

800

1000U

SA

Ger

man

y

Jap

an

Ch

ina

Fra

nce

UK

Can

ada

Ital

y

Net

her

lan

ds

Bel

giu

m

Kor

ea

Mex

ico

Tai

wan

Sin

gap

ore

Sp

ain

Exports Imports

Bil

lion

s of

US

Dol

lars

(19

99)

Source: World Trade Organization 2000 Annual Report

Page 8: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Who needs Cross-Language Search?

• When users can read several languages– Eliminate multiple queries– Query in most fluent language

• Monolingual users can also benefit– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images

Page 9: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

The Problem Space• Retrospective search

– Web search

– Specialized services (medicine, law, patents)

– Help desks

• Real-time filtering– Email spam

– Web parental control

– News personalization

• Real-time interaction– Instant messaging

– Chat rooms

– Teleconferences

Key Capabilities Map across languages

– For human understanding

– For automated processing

Page 10: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

A Little (Confusing) Vocabulary• Multilingual document

– Document containing more than one language

• Multilingual collection– Collection of documents in different languages

• Multilingual system– Can retrieve from a multilingual collection

• Cross-language system– Query in one language finds document in another

• Translingual system– Queries can find documents in any language

Page 11: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

The Information Retrieval CycleSource

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

source reselection

System discoveryVocabulary discoveryConcept discoveryDocument discovery

How do you formulate a query?

If you can’t understand the documents…

How do you know something is worth looking at?

How can you understand the retrieved documents?

Page 12: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

TranslationTranslingual

BrowsingTranslingual

Search

Query Document

Select Examine

InformationAccess

InformationUse

Page 13: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Early Work

• 1964 International Road Research– Multilingual thesauri

• 1970 SMART– Dictionary-based free-text cross-language retrieval

• 1978 ISO Standard 5964 (revised 1985)– Guidelines for developing multilingual thesauri

• 1990 Latent Semantic Indexing– Corpus-based free-text translingual retrieval

Page 14: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Multilingual Thesauri

• Build a cross-cultural knowledge structure– Cultural differences influence indexing choices

• Use language-independent descriptors– Matched to language-specific lead-in vocabulary

• Three construction techniques– Build it from scratch– Translate an existing thesaurus– Merge monolingual thesauri

Page 15: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

C ross -L an g u ag e R etrieva lIn d exin g L an g u ag esM ach in e-A ss is ted In d exin g

In fo rm ation R e trieva l

M u lt ilin g u a l M e tad a ta

D ig ita l L ib ra ries

In te rn a tion a l In fo rm ation F lowD iffu s ion o f In n ova tion

In fo rm ation U se

A u tom atic A b s trac tin g

Inform ation Science

M ach in e Tran s la tionIn fo rm ation E xtrac tionText S u m m ariza tion

N atu ra l L an g u ag e P rocess in g

M u ltilin g u a l O n to log ies

O n to log ica l E n g in eerin g

Textu a l D a ta M in in g

K n ow led g e D iscovery

M ach in e L earn in g

Artificial Intelligence

L oca liza tionIn fo rm ation V isu a liza tion

H u m an -C om p u ter In te rac tion

W eb In te rn a tion a liza tion

W orld -W id e W eb

Top ic D e tec tion an d Track in g

S p eech P rocess in g

M u ltilin g u a l O C R

D ocu m en t Im ag e U n d ers tan d in g

Other Fields

M ultilingua l In form ation Access

Page 16: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Free Text CLIR

• What to translate?– Queries or documents

• Where to get translation knowledge?– Dictionary or corpus

• How to use it?

Page 17: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

The Search Process

Choose Document-Language

Terms

Query-DocumentMatching

InferConcepts

Select Document-Language

Terms

Document

Author

Query

Choose Document-Language

Terms

MonolingualSearcher

Choose Query-Language

Terms

Cross-LanguageSearcher

Page 18: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Translingual Retrieval Architecture

LanguageIdentification

EnglishTerm

Selection

ChineseTerm

Selection

Cross-LanguageRetrieval

MonolingualChineseRetrieval

3: 0.91 4: 0.575: 0.36

1: 0.722: 0.48

ChineseQuery

ChineseTerm

Selection

Page 19: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Evidence for Language Identification

• Metadata– Included in HTTP and HTML

• Word-scale features– Which dictionary gets the most hits?

• Subword features– Character n-gram statistics

Page 20: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Query-Language IR

Englishqueries

Chinese Document Collection

Retrieval Engine

TranslationSystem

English Document Collection

Results

select examine

Page 21: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Example: Modular use of MT

• Select a single query language

• Translate every document into that language

• Perform monolingual retrieval

Page 22: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

TDT-3 Mandarin Broadcast News

Systran

Balanced 2-best translation

Is Machine Translation Enough?

Page 23: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Document-Language IR

Retrieval Engine

Translation System

Chinesequeries

Chinesedocuments

Results

Englishqueries

select examine

Chinese Document Collection

Page 24: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Query vs. Document Translation

• Query translation– Efficient for short queries (not relevance feedback)– Limited context for ambiguous query terms

• Document translation– Rapid support for interactive selection– Need only be done once (if query language is same)

• Merged query and document translation– Can produce better effectiveness than either alone

Page 25: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Interlingual Retrieval

InterlingualRetrieval

3: 0.91 4: 0.575: 0.36

QueryTranslation

ChineseQueryTerms

EnglishDocument

Terms

DocumentTranslation

Page 26: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Learning From Document Pairs

E1 E2 E3 E4 E5 S1 S2 S3 S4

Doc 1

Doc 2

Doc 3

Doc 4

Doc 5

4 2 2 1

8 4 4 2

2 2 2 1

2 1 2 1

4 1 2 1

English Terms Spanish Terms

Page 27: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Generalized Vector Space Model

• “Term space” of each language is different– Document links define a common “document space”

• Describe documents based on the corpus– Vector of similarities to each corpus document

• Compute cosine similarity in document space

• Very effective in a within-domain evaluation

Page 28: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Latent Semantic Indexing

• Cosine similarity captures noise with signal– Term choice variation and word sense ambiguity

• Signal-preserving dimensionality reduction– Conflates terms with similar usage patterns

• Reduces term choice effect, even across languages

• Computationally expensive

Page 29: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

oilpetroleum

probesurveytake samples

Whichtranslation?

Notranslation!

restrainoilpetroleum

probesurveytake samples

cymbidium goeringiiWrong

segmentation

Page 30: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

What’s a “Term?”

• Granularity of a “term” depends on the task– Long for translation, more fine-grained for retrieval

• Phrases improve translation two ways– Less ambiguous than single words– Idiomatic expressions translate as a single concept

• Three ways to identify phrases– Semantic (e.g., appears in a dictionary)– Syntactic (e.g., parse as a noun phrase)– Co-occurrence (appear together unexpectedly often)

Page 31: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Learning to Translate• Lexicons

– Phrase books, bilingual dictionaries, …

• Large text collections– Translations (“parallel”)– Similar topics (“comparable”)

• Similarity– Similar pronunciation

• People

Page 32: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Types of Lexical Resources• Ontology

– Organization of knowledge

• Thesaurus– Ontology specialized to support search

• Dictionary– Rich word list, designed for use by people

• Lexicon– Rich word list, designed for use by a machine

• Bilingual term list– Pairs of translation-equivalent terms

Page 33: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Original query: El Nino and infectious diseases

Term selection: “El Nino” infectious diseases

Term translation:(Dictionary coverage: “El Nino” is not

found)

Translation selection:

Query formulation:Structure:

Dictionary-Based Query Translation

Page 34: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Four-Stage Backoff

• Tralex might contain stems, surface forms, or some combination of the two.

mangez mangez

mangez mangemange

mangezmange mange

mangez mange mangent mange

- eat

- eats eat

- eat

- eat

Document Translation Lexicon

surface form surface form

stem surface form

surface form stem

stem stem

French stemmer: Oard, Levow, and Cabezas (2001); English: Inquiry’s kstem

Page 35: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Results

STRAND corpus tralex (N=1) 0.2320

STRAND corpus tralex (N=2) 0.2440

STRAND corpus tralex (N=3) 0.2499

Merging by voting 0.2892

Baseline: downloaded dictionary 0.2919

Backoff from dictionary to corpus tralex

0.3282

Condition Mean Average Precision

+12% (p < .01) relative

Page 36: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Results Detail

mAP

Page 37: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Exploiting Part-of-Speech (POS)

• Constrain translations by part-of-speech– Requires POS tagger and POS-tagged lexicon

• Works well when queries are full sentences– Short queries provide little basis for tagging

• Constrained matching can hurt monolingual IR– Nouns in queries often match verbs in documents

Page 38: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

The Short Query Challenge

0 1 2 3

1997

1998

1999

Number of Terms Per Query

English

Other EuropeanLanguages (German,French, Italian, Dutch,Swedish)

Source: Jack Xu, Excite@Home, 1999

Page 39: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

“Structured Queries”

• Weight of term a in a document i depends on:– TF(a,i): Frequency of term a in document i– DF(a): How many documents term a occurs in

• Build pseudo-terms from alternate translations– TF (syn(a,b),i) = TF(a,i)+TF(b,i)– DF (syn(a,b) = |{docs with a}U{docs with b}|

• Downweight terms with any common translation– Particularly effective for long queries

Page 40: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

(Query Terms: 1: 2: 3: )

Computing Weights

• Unbalanced:– Overweights query terms that have many translations

• Balanced (#sum): – Sensitive to rare translations

• Pirkola (#syn):– Deemphasizes query terms with any common translation

][3

1

3

3

2

2

1

1

DF

TF

DF

TF

DF

TF

])(2

1[

2

1

3

3

2

2

1

1

DF

TF

DF

TF

DF

TF

][2

1

3

3

21

21

DF

TF

DFDF

TFTF

Page 41: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Ranked Retrieval

English/EnglishTranslationLexicon

Measuring Coverage Effects

Ranked List

113,000CLEF EnglishNews Stories

CLEFRelevance Judgments

Evaluation

Measure of Effectiveness

33 English Queries (TD)

Page 42: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

35 Bilingual Term Lists• Chinese (193, 111)

• German (103, 97, 89, 6)

• Hungarian (63)

• Japanese (54)

• Spanish (35, 21, 7)

• Russian (32)

• Italian (28, 13, 5)

• French (20, 17, 3)

• Esperanto (17)

• Swedish (10)

• Dutch (10)

• Norwegian (6)

• Portuguese (6)

• Greek (5)

• Afrikaans (4)

• Danish (4)

• Icelandic (3)

• Finnish (3)

• Latin (2)

• Welsh (1)

• Indonesian (1)

• Old English (1)

• Swahili (1)

• Eskimo (1)

Page 43: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Size Effect

String matching

Stem matching 7% OOV

Page 44: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Out-of-Vocabulary Distribution

Page 45: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Measuring Named Entity Effect

ComputeTerm Weights

Build Index

EnglishDocuments

ComputeTerm Weights

ComputeDocument Score

Sort ScoresRankedList

EnglishQuery

TranslationLexicon

- NamedEntities

+ NamedEntities

Page 46: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Named entities removed

Named entities from term list

Named entities added

Full Query

Page 47: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Hieroglyphic

Egyptian Demotic

Greek

Page 48: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Types of Bilingual Corpora

• Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs

• Comparable corpora: topically related– Collection pairs– Document pairs

Page 49: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Exploiting Parallel Corpora

• Automatic acquisition of translation lexicons

• Statistical machine translation

• Corpus-guided translation selection

• Document-linked techniques

Page 50: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Some Modern Rosetta Stones• News:

– DE-News (German-English)– Hong-Kong News, Xinhua News (Chinese-English)

• Government:– Canadian Hansards (French-English)– Europarl (Danish, Dutch, English, Finnish, French, German, Greek,

Italian, Portugese, Spanish, Swedish)– UN Treaties (Russian, English, Arabic, …)

• Religion– Bible, Koran, Book of Mormon

Page 51: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Parallel Corpus• Example from DE-News (8/1/1996)

Diverging opinions about planned tax reform

Unterschiedliche Meinungen zur geplanten Steuerreform

The discussion around the envisaged major tax reform continues .

Die Diskussion um die vorgesehene grosse Steuerreform dauert an .

The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 .

Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .

English:

German:

English:

German:

English:

German:

Page 52: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Word-Level Alignment

Diverging opinions about planned tax reform

Unterschiedliche Meinungen zur geplanten Steuerreform

English

German

Madam President , I had asked the administration …English

Señora Presidenta, había pedido a la administración del Parlamento …Spanish

Page 53: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

A Translation Model

• From word-aligned bilingual text, we induce a translation model

• Example:)|( efp i

1)|( if

i efpwhere,

p( 探测 |survey) = 0.4p( 试探 |survey) = 0.3p( 测量 |survey) = 0.25p( 样品 |survey) = 0.05

Page 54: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Using Multiple Translations• Weighted Structured Query Translation

– Takes advantage of multiple translations and translation probabilities

• TF and DF of query term e are computed using TF and DF of its translations:

if

kiik DfTFefpDeTF ),()|(),(

if

ii fDFefpeDF )()|()(

Page 55: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Evaluating Corpus-Based Techniques

• Within-domain evaluation (upper bound)– Partition a bilingual corpus into training and test– Use the training part to tune the system– Generate relevance judgments for evaluation part

• Cross-domain evaluation (fair)– Use existing corpora and evaluation collections– No good metric for degree of domain shift

Page 56: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Ranked Retrieval EffectivenessTitle Queries

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Threshold

Mea

n A

vera

ge

Pre

cisi

on

Pirkola

Kwok

MDF

WTF

WDF

WTF/DF

Baseline

English queries, Arabic documents

Page 57: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Exploiting Comparable Corpora

• Blind relevance feedback– Existing CLIR technique + collection-linked corpus

• Lexicon enrichment– Existing lexicon + collection-linked corpus

• Dual-space techniques– Document-linked corpus

Page 58: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Bilingual Query Expansionsource language query

QueryTranslation

resultsSource

LanguageIR

TargetLanguage

IR

source language collection

target language collection

expandedsource language

query

expandedtarget language

terms

Pre-translation expansion Post-translation expansion

Page 59: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 5,000 10,000 15,000

Unique Dutch Terms

Me

an

Av

era

ge

Pre

cis

ion

Both

Post

Pre

None

Query Expansion Effect

Paul McNamee and James Mayfield, SIGIR-2002

Page 60: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Blind Relevance Feedback

• Augment a representation with related terms– Find related documents, extract distinguishing terms

• Multiple opportunities:– Before doc translation: Enrich the vocabulary– After doc translation: Mitigate translation errors – Before query translation: Improve the query– After query translation: Mitigate translation errors

• Short queries get the most dramatic improvement

Page 61: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Indexing Time: Doc Translation

0

100

200

300

400

500

Thousands of documents

Inde

xing

tim

e (s

ec)

monolingual cross-language

Page 62: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Post-Translation “Document Expansion”

Mandarin Chinese Documents

Term-to-TermTranslation

EnglishCorpus

IRSystem

Top 5

AutomaticSegmentation

TermSelection

IRSystem

Results

EnglishQuery

Document to be Indexed

Single Document

Page 63: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Why Document Expansion Works

• Story-length objects provide useful context

• Ranked retrieval finds signal amid the noise

• Selective terms discriminate among documents– Enrich index with low DF terms from top documents

• Similar strategies work well in other applications– CLIR query translation

– Monolingual spoken document retrieval

Page 64: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Lexicon Enrichment

… Cross-Language Evaluation Forum …

… Solto Extunifoc Tanixul Knadu …

?

Page 65: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Lexicon Enrichment

• Use a bilingual lexicon to align “context regions”– Regions with high coincidence of known translations

• Pair unknown terms with unmatched terms– Unknown: language A, not in the lexicon– Unmatched: language B, not covered by translation

• Treat the most surprising pairs as new translations

Page 66: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Cognate Matching

• Dictionary coverage is inherently limited– Translation of proper names– Translation of newly coined terms– Translation of unfamiliar technical terms

• Strategy: model derivational translation– Orthography-based– Pronunciation-based

Page 67: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Matching Orthographic Cognates

• Retain untranslatable words unchanged– Often works well between European languages

• Rule-based systems– Even off-the-shelf spelling correction can help!

• Character-level statistical MT– Trained using a set of representative cognates

Page 68: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Matching Phonetic Cognates

• Forward transliteration– Generate all potential transliterations

• Reverse transliteration– Guess source string(s) that produced a transliteration

• Match in phonetic space

Page 69: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Leveraging Cognates

StringComparison

WrittenForm

WrittenForm

AlphabeticTransliteration

Pronunciation

PhoneticTransliteration

Pronunciation

SpokenForm

SpokenForm

PhoneticComparison

Similarity

Similarity

Page 70: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Cross-Language “Retrieval”

Search

Translated Query

Ranked List

QueryTranslation

Query

Page 71: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Interactive Translingual Search

Search

Translated Query

Selection

Ranked List

Examination

Document

Use

Document

QueryFormulation

QueryTranslation

Query

Query Reformulation

MT

Translated “Headlines”

English Definitions

Page 72: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Selection

• Goal: Provide information to support decisions

• May not require very good translations– e.g., Term-by-term title translation

• People can “read past” some ambiguity– May help to display a few alternative translations

Page 73: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Language-Specific Selection

Swiss bankQuery in English: Search

English German(Swiss)(Bankgebäude, bankverbindung, bank)

1 (0.72) Swiss Bankers CriticizedAP / June 14, 1997

2 (0.48) Bank Director ResignsAP / July 24, 1997

1 (0.91) U.S. Senator Warpathing NZZ / June 14, 1997

2 (0.57) [Bankensecret] Law ChangeSDA / August 22, 1997

3 (0.36) Banks Pressure ExistentNZZ / May 3, 1997

Page 74: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Translingual Selection

Swiss bankQuery in English: Search

German Query: (Swiss)(Bankgebäude, bankverbindung, bank)

1 (0.91) U.S. Senator Warpathing NZZ June 14, 19972 (0.57) [Bankensecret] Law Change SDA August 22, 19973 (0.52) Swiss Bankers Criticized AP June 14, 19974 (0.36) Banks Pressure Existent NZZ May 3, 19975 (0.28) Bank Director Resigns AP July 24, 1997

Page 75: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Merging Ranked Lists

• Types of Evidence– Rank– Score

• Evidence Combination– Weighted round robin– Score combination

• Parameter tuning– Condition-based– Query-based

1 voa4062 .22 2 voa3052 .21 3 voa4091 .17 …1000 voa4221 .04

1 voa4062 .52 2 voa2156 .37 3 voa3052 .31 …1000 voa2159 .02

1 voa4062 2 voa3052 3 voa2156 …1000 voa4201

Page 76: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Examination Interface

• Two goals– Refine document delivery decisions– Support vocabulary discovery for query refinement

• Rapid translation is essential– Document translation retrieval strategies are a good fit– Focused on-the-fly translation may be a viable alternative

Page 77: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Uh oh…

Page 78: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Translation for Assessment

Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …

Page 79: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

MT in a Month

50 60 70 80 90

bestcompeting

ISI public

ISI public+

ISIunrestricted

ISI late

Human 6

Human 5

PercentHuman CasedNISTr3n4score

Page 80: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.
Page 81: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Experiment Design

Participant

1

2

3

4

Task Order

Narrow:

Broad:

Topic Key

System Key

System B:

System A:

Topic11, Topic17 Topic13, Topic29

Topic11, Topic17 Topic13, Topic29

Topic17, Topic11 Topic29, Topic13

Topic17, Topic11 Topic29, Topic13

11, 13

17, 29

Page 82: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Maryland Experiments

• MT is almost always better– Significant overall and for narrow topics alone (one-tailed t-test, p<0.05)

• F measure is less insightful for narrow topics– Always near 0 or 1

0

0.2

0.4

0.6

0.8

1

1.2

umd01 umd02 umd03 umd04 umd01 umd02 umd03 umd04

Participant

Ave

rag

e F

_0.8

on

Tw

o T

op

ics MT

GLOSS|---------- Broad topics -----------| |--------- Narrow topics -----------|

Page 83: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

iCLEF 2002 Evaluation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 Average

Topics

F-m

easu

re

Automatic

User-Assisted

English QueriesGerman Documents20 minutes/topic

Page 84: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Better Mental Process Models

iCLEF 2003, 10 minute sessions, each bar averages 4 searchers

Num

ber

of Q

uerie

s

Page 85: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Delivery

• Use may require high-quality translation– Machine translation quality is often rough

• Route to best translator based on:– Acceptable delay– Required quality (language and technical skills)– Cost

Page 86: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Where Things Stand• Ranked retrieval works well across languages

– Bonus: easily extended to text classification– Caveat: mostly demonstrated on news stories

• Machine translation is okay for niche markets– Keep an eye on this: accuracy is improving fast

• Building explainable systems seems possible

Page 87: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Recap: Finding What You Can’t Read

• Three key challenges– Segmentation, coverage, evidence combination

• Segmentation objectives differ– Translation: Favor precision over coverage– Retrieval: Balance precision and recall

• Multiple coverage enhancement techniques– Expansion, backoff translation, cognate matching

• Translating evidence beats translating weights

Page 88: Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.

Research Opportunities

User Interaction

Translation Selection

Transliteration

Comparable Corpora

Parallel Corpora

Dictionaries

Term Selection

Percieved Opportunities Past InvestmentSegmentation &Phrase Indexing

Lexical Coverage