Post on 11-Jan-2016
description
August 21, 2002 Szechenyi National Library
Support for Multilingual Information Access
Douglas W. OardCollege of Information Studies and
Institute for Advanced Computer Studies
University of Maryland, College Park, MD, USA
Multilingual Information Access
Help people find information that is expressed in any language
Outline
• User needs
• System design
• User studies
• Next steps
Global Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
Source: Global Reach
English English
2000 2005
Global Internet User Population
Chinese
0.1
1.0
10.0
100.0
Inte
rnet
Hos
ts (
mill
ion)
:
Eng
lish
Japa
nese
Ger
man
Fre
nch
Dut
ch
Fin
nish
Span
ish
Chi
nese
Swed
ish
Language (estimated by domain)
Global Internet Hosts
Source: Network Wizards Jan 99 Internet Domain Survey
European Web Size Projection
0.1
1.0
10.0
100.0
1,000.0
10,000.0
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
Global Internet Audio
source: www.real.com, Mar 2001
10621438
English
OtherLanguages
Over 2500 Internet-accessible
Radio and TelevisionStations
Who needs Cross-Language Search?
• Searchers who can read several languages– Eliminate multiple queries– Query in most fluent language
• Monolingual searchers– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images
Outline
• User needs
System design
• User studies
• Next steps
C ross -L an g u ag e R etrieva lIn d exin g L an g u ag esM ach in e-A ss is ted In d exin g
In fo rm ation R e trieva l
M u lt ilin g u a l M e tad a ta
D ig ita l L ib ra ries
In te rn a tion a l In fo rm ation F lowD iffu s ion o f In n ova tion
In fo rm ation U se
A u tom atic A b s trac tin g
Inform ation Science
M ach in e Tran s la tionIn fo rm ation E xtrac tionText S u m m ariza tion
N atu ra l L an g u ag e P rocess in g
M u ltilin g u a l O n to log ies
O n to log ica l E n g in eerin g
Textu a l D a ta M in in g
K n ow led g e D iscovery
M ach in e L earn in g
Artificial Intelligence
L oca liza tionIn fo rm ation V isu a liza tion
H u m an -C om p u ter In te rac tion
W eb In te rn a tion a liza tion
W orld -W id e W eb
Top ic D e tec tion an d Track in g
S p eech P rocess in g
M u ltilin g u a l O C R
D ocu m en t Im ag e U n d ers tan d in g
Other Fields
M ultilingua l In form ation Access
Cross-LanguageSearch
Query
Translation
DocumentDelivery
Cross-LanguageBrowsing
Select Examine
Multilingual Information Access
The Search Process
Choose Document-Language
Terms
Query-DocumentMatching
InferConcepts
Select Document-Language
Terms
Document
Author
Query
Choose Document-Language
Terms
MonolingualSearcher
Choose Query-Language
Terms
Cross-LanguageSearcher
Interactive Search
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
Synonym Selection
KeyWord In Context (KWIC)
Outline
• User needs
• System design
User studies
• Next steps
Cross-Language Evaluation Forum
• Annual European-language retrieval evaluation– Documents: 8 languages
• Dutch, English, Finnish, French, German, Italian, Spanish, Swedish
– Topics: 8 languages, plus Chinese and Japanese– Batch retrieval since 2000
• Interactive track (iCLEF) started in 2001– 2001 focus: document selection– 2002 focus: query formulation
iCLEF 2001 Experiment Design
Participant
1
2
3
4
Task Order
Narrow:
Broad:
Topic Key
System Key
System B:
System A:
Topic11, Topic17 Topic13, Topic29
Topic11, Topic17 Topic13, Topic29
Topic17, Topic11 Topic29, Topic13
Topic17, Topic11 Topic29, Topic13
11, 13
17, 29
144 trials, in blocks of 16, at 3 sites
An Experiment Session
• Task and system familiarization
• 4 searches (20 minutes each)– Read topic description– Examine document translations– Judge as many documents as possible
• Relevant, Somewhat relevant, Not relevant, Unsure, Not judged
• Instructed to seek high precision
• 8 questionnaires– Initial, each topic (4), each system (2), final
Measure of Effectiveness
• Unbalanced F-Measure:– P = precision
– R = recall = 0.8
• Favors precision over recall
• This models an application in which:– Fluent translation is expensive
– Missing some relevant documents would be okay
RP
F
11
French Results OverviewCLEF
AUTO
English Results OverviewCLEF
AUTO
Commercial vs. Gloss Translation
• Commercial Machine Translation (MT) is almost always better– Significant with one-tail t-test (p<0.05) over 16 trials
• Gloss translation usually beats random selection
0
0.2
0.4
0.6
0.8
1
1.2
umd01 umd02 umd03 umd04 umd01 umd02 umd03 umd04
Searcher
Ret
riev
al E
ffec
tiven
ess
MT
GLOSS
|-------- Broad topics ----------| |-------- Narrow topics ---------|
iCLEF 2002 Experiment Design
QueryFormulation
AutomaticRetrieval
InteractiveSelection
MeanAveragePrecision
F0.8
StandardRanked List
Topic Description
Maryland Experiments
• 48 trials (12 participants)– Half with automatic query translation– Half with semi-automatic query translation
• 4 subjects searched Der Spiegel and SDA– 20-60 relevant documents for 4 topics
• 8 subjects searched Der Spiegel– 8-20 relevant documents for 3 topics
• 0 relevant documents for 1 topic!
Some Preliminary Results
• Average of 8 query iterations per search
• Relatively insensitive to topic– Topic 4 (Hunger Strikes): 6 iterations– Topic 2 (Treasure Hunting): 16 iterations
• Sometimes sensitive to system– Topics 1 and 2: system effect was small– Topics 3 and 4: fewer iterations with semi-automatic
• Topic 3: European Campaigns against Racism
Subjective Evaluation• Semi-automatic system:
– Ability to select translations – good
• Automatic system:– Simpler / less user-involvement needed - good– Few functions / easier to learn and use – good– No control over translations - bad
• Both systems:– Highlighting keywords helps - good– Untranslated/poorly-translated words - bad– No Boolean or proximity operator – bad
Outline
• User needs
• System design
• User studies
Next steps
Next Steps
• Quantitative analysis from 2002 (MAP, F)– Iterative improvement of query quality
• Utility of MAP as a measure of query quality?
• Utility of semiautomatic translation
– Accuracy of relevance judgments
• Search strategies– Dependence on system– Dependence on topic– Dependence on density of relevant documents
An Invitation
• Join CLEF – A first step: Hungarian topics– http://clef.iei.pi.cnr.it
• Join iCLEF– Help us focus on true user needs!– http://terral.lsi.uned.es/iCLEF