a Multilingual DSL for Information Extraction from Lattes Platform
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W....
-
Upload
rosemary-johns -
Category
Documents
-
view
217 -
download
0
Transcript of August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W....
![Page 1: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/1.jpg)
August 21, 2002 Szechenyi National Library
Support for Multilingual Information Access
Douglas W. OardCollege of Information Studies and
Institute for Advanced Computer Studies
University of Maryland, College Park, MD, USA
![Page 2: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/2.jpg)
Multilingual Information Access
Help people find information that is expressed in any language
![Page 3: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/3.jpg)
Outline
• User needs
• System design
• User studies
• Next steps
![Page 4: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/4.jpg)
Global Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
![Page 5: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/5.jpg)
Source: Global Reach
English English
2000 2005
Global Internet User Population
Chinese
![Page 6: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/6.jpg)
0.1
1.0
10.0
100.0
Inte
rnet
Hos
ts (
mill
ion)
:
Eng
lish
Japa
nese
Ger
man
Fre
nch
Dut
ch
Fin
nish
Span
ish
Chi
nese
Swed
ish
Language (estimated by domain)
Global Internet Hosts
Source: Network Wizards Jan 99 Internet Domain Survey
![Page 7: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/7.jpg)
European Web Size Projection
0.1
1.0
10.0
100.0
1,000.0
10,000.0
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
![Page 8: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/8.jpg)
Global Internet Audio
source: www.real.com, Mar 2001
10621438
English
OtherLanguages
Over 2500 Internet-accessible
Radio and TelevisionStations
![Page 9: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/9.jpg)
Who needs Cross-Language Search?
• Searchers who can read several languages– Eliminate multiple queries– Query in most fluent language
• Monolingual searchers– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images
![Page 10: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/10.jpg)
Outline
• User needs
System design
• User studies
• Next steps
![Page 11: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/11.jpg)
C ross -L an g u ag e R etrieva lIn d exin g L an g u ag esM ach in e-A ss is ted In d exin g
In fo rm ation R e trieva l
M u lt ilin g u a l M e tad a ta
D ig ita l L ib ra ries
In te rn a tion a l In fo rm ation F lowD iffu s ion o f In n ova tion
In fo rm ation U se
A u tom atic A b s trac tin g
Inform ation Science
M ach in e Tran s la tionIn fo rm ation E xtrac tionText S u m m ariza tion
N atu ra l L an g u ag e P rocess in g
M u ltilin g u a l O n to log ies
O n to log ica l E n g in eerin g
Textu a l D a ta M in in g
K n ow led g e D iscovery
M ach in e L earn in g
Artificial Intelligence
L oca liza tionIn fo rm ation V isu a liza tion
H u m an -C om p u ter In te rac tion
W eb In te rn a tion a liza tion
W orld -W id e W eb
Top ic D e tec tion an d Track in g
S p eech P rocess in g
M u ltilin g u a l O C R
D ocu m en t Im ag e U n d ers tan d in g
Other Fields
M ultilingua l In form ation Access
![Page 12: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/12.jpg)
Cross-LanguageSearch
Query
Translation
DocumentDelivery
Cross-LanguageBrowsing
Select Examine
Multilingual Information Access
![Page 13: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/13.jpg)
The Search Process
Choose Document-Language
Terms
Query-DocumentMatching
InferConcepts
Select Document-Language
Terms
Document
Author
Query
Choose Document-Language
Terms
MonolingualSearcher
Choose Query-Language
Terms
Cross-LanguageSearcher
![Page 14: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/14.jpg)
Interactive Search
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
![Page 15: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/15.jpg)
![Page 16: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/16.jpg)
Synonym Selection
![Page 17: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/17.jpg)
KeyWord In Context (KWIC)
![Page 18: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/18.jpg)
![Page 19: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/19.jpg)
Outline
• User needs
• System design
User studies
• Next steps
![Page 20: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/20.jpg)
Cross-Language Evaluation Forum
• Annual European-language retrieval evaluation– Documents: 8 languages
• Dutch, English, Finnish, French, German, Italian, Spanish, Swedish
– Topics: 8 languages, plus Chinese and Japanese– Batch retrieval since 2000
• Interactive track (iCLEF) started in 2001– 2001 focus: document selection– 2002 focus: query formulation
![Page 21: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/21.jpg)
iCLEF 2001 Experiment Design
Participant
1
2
3
4
Task Order
Narrow:
Broad:
Topic Key
System Key
System B:
System A:
Topic11, Topic17 Topic13, Topic29
Topic11, Topic17 Topic13, Topic29
Topic17, Topic11 Topic29, Topic13
Topic17, Topic11 Topic29, Topic13
11, 13
17, 29
144 trials, in blocks of 16, at 3 sites
![Page 22: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/22.jpg)
An Experiment Session
• Task and system familiarization
• 4 searches (20 minutes each)– Read topic description– Examine document translations– Judge as many documents as possible
• Relevant, Somewhat relevant, Not relevant, Unsure, Not judged
• Instructed to seek high precision
• 8 questionnaires– Initial, each topic (4), each system (2), final
![Page 23: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/23.jpg)
Measure of Effectiveness
• Unbalanced F-Measure:– P = precision
– R = recall = 0.8
• Favors precision over recall
• This models an application in which:– Fluent translation is expensive
– Missing some relevant documents would be okay
RP
F
11
![Page 24: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/24.jpg)
French Results OverviewCLEF
AUTO
![Page 25: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/25.jpg)
English Results OverviewCLEF
AUTO
![Page 26: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/26.jpg)
Commercial vs. Gloss Translation
• Commercial Machine Translation (MT) is almost always better– Significant with one-tail t-test (p<0.05) over 16 trials
• Gloss translation usually beats random selection
0
0.2
0.4
0.6
0.8
1
1.2
umd01 umd02 umd03 umd04 umd01 umd02 umd03 umd04
Searcher
Ret
riev
al E
ffec
tiven
ess
MT
GLOSS
|-------- Broad topics ----------| |-------- Narrow topics ---------|
![Page 27: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/27.jpg)
iCLEF 2002 Experiment Design
QueryFormulation
AutomaticRetrieval
InteractiveSelection
MeanAveragePrecision
F0.8
StandardRanked List
Topic Description
![Page 28: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/28.jpg)
Maryland Experiments
• 48 trials (12 participants)– Half with automatic query translation– Half with semi-automatic query translation
• 4 subjects searched Der Spiegel and SDA– 20-60 relevant documents for 4 topics
• 8 subjects searched Der Spiegel– 8-20 relevant documents for 3 topics
• 0 relevant documents for 1 topic!
![Page 29: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/29.jpg)
Some Preliminary Results
• Average of 8 query iterations per search
• Relatively insensitive to topic– Topic 4 (Hunger Strikes): 6 iterations– Topic 2 (Treasure Hunting): 16 iterations
• Sometimes sensitive to system– Topics 1 and 2: system effect was small– Topics 3 and 4: fewer iterations with semi-automatic
• Topic 3: European Campaigns against Racism
![Page 30: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/30.jpg)
Subjective Evaluation• Semi-automatic system:
– Ability to select translations – good
• Automatic system:– Simpler / less user-involvement needed - good– Few functions / easier to learn and use – good– No control over translations - bad
• Both systems:– Highlighting keywords helps - good– Untranslated/poorly-translated words - bad– No Boolean or proximity operator – bad
![Page 31: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/31.jpg)
Outline
• User needs
• System design
• User studies
Next steps
![Page 32: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/32.jpg)
Next Steps
• Quantitative analysis from 2002 (MAP, F)– Iterative improvement of query quality
• Utility of MAP as a measure of query quality?
• Utility of semiautomatic translation
– Accuracy of relevance judgments
• Search strategies– Dependence on system– Dependence on topic– Dependence on density of relevant documents
![Page 33: August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.](https://reader036.fdocuments.in/reader036/viewer/2022062321/56649e3f5503460f94b2f4c7/html5/thumbnails/33.jpg)
An Invitation
• Join CLEF – A first step: Hungarian topics– http://clef.iei.pi.cnr.it
• Join iCLEF– Help us focus on true user needs!– http://terral.lsi.uned.es/iCLEF