SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea...
-
Upload
zakary-burbridge -
Category
Documents
-
view
214 -
download
0
Transcript of SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea...
SINAI-GIRSINAI-GIR
A Multilingual Geographical IR SystemA Multilingual Geographical IR System
University of Jaén (Spain)
José Manuel Perea Ortega
CLEF 2008, 18 September, Aarhus (Denmark)
Computer Science Department
Introduction
• Preliminary work of SINAI in GeoCLEF: – 2006: query expansion using gazetteers and
thesaurus [García-Vega et al., 2007]– 2007: filtering documents based on manual rules
[Perea-Ortega et al., 2007]
• GeoCLEF 2008:– Filtering documents using new manual rules and
new approachs (query reformulation, keywords and hyponyms extraction, query geo-expansion)
GeoCLEF 2008, Aarhus
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Translates the queries from other languages into English
We have used SINTRAM (SINai TRAnslation Module) [García-Cumbreras et al., 2007]
It works with different online machine translators
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Preprocessing: stemming, stopwords, POS The toponyms are extracted (NER) Two indexes are generated:
• Locations• Keywords
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Query Preprocessing: stemming, stopwords, removes irrelevant information
The toponyms are extracted (NER) Spatial relations finder based on manual rules Query reformulation based on POS tagging and
query parsing subtask Geo-expansion using a gazetteer Keywords/Hyponyms detection
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Lemur as index-search engine
Okapi with PRF as weighting function
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Filter the list of documents recovered by the IR subsystem, applying different manual rules and using the geographical data detected in the query
Re-rank the documents using predefined weights for each rule and the keywords/hyponyms detected in the query
Experiments description
• SINAI has participated in mono and bilingual tasks with a total of 15 experiments15 experiments:– MONO-EN: 9 experiments– BILI-X2EN: 6 experiments
• Combining the content of topic labels: TD or TDN• BaselineBaseline: Q1 without applying any filtering or re-
ranking process• Other experimentsOther experiments:
– Filtering and re-ranking of the fusion list of the documents recovered by the Q1, Q2 and Q3
– Using keywords and/or hyponyms in the re-ranking process
GeoCLEF 2008, Aarhus
MONO-EN results
GeoCLEF 2008, Aarhus
Best result: baselinebaseline (no filtering and no re-ranking)
In some filtering experiments the use of keywords improves the results
Best results using only the TD topic labels
BILI-X2EN results
GeoCLEF 2008, Aarhus
Best result: baselinebaseline (no filtering and no re-ranking) with Portuguese topics
Best results using only the TD topic labels
Conclusions
• The baseline experiment seems to work well because we include the geo-information in the retrieval process
• The filtering of documents does not seem to work well because we include the geo-information in the query and we are re-ranking documents which maybe are not relevant with respect to their content
• The use of keywords for re-ranking the documents retrieved could be interesting because in some experiments it improves the results obtained without using them
• Query reformulation could be also interesting because for some topics it retrieves valid documents which are not retrieved with the default query
GeoCLEF 2008, Aarhus
TextMESS at GeoCLEF 2008
• Spanish TextMESS projectTextMESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies): joint participation by the Polytechnic University of Valencia and University of Jaén (SINAI)
• Method employed: merging algorithm based on merging algorithm based on fuzzy Borda voting schemefuzzy Borda voting scheme, taking as input the , taking as input the two document lists returned by both systemstwo document lists returned by both systems
• Second best result in the monolingual English task
GeoCLEF 2008, Aarhus
Thank you
GeoCLEF 2008, Aarhus
sinai.ujaen.es
• References
– García-Vega, Manuel and García-Cumbreras, Miguel A. and Ureña-López, L.A. and Perea-Ortega, José M. GEOUJA System. The first participation of the University of Jaén at GEOCLEF 2006. In LNCS, volume 4730, pages 913-917. Springer-Verlag, 2007.
– Perea-Ortega, Jose M. and García-Cumbreras, Miguel A. and García-Vega, Manuel and Montejo-Ráez, Arturo. GEOUJA System. University of Jaén at GEOCLEF 2007. In Proceedings of the Cross Language Evaluation Forum (CLEF 2007), page 52, 2007.
– García-Cumbreras, Miguel A. and Ureña-López, L. Alfonso and Martínez-Santiago, Fernando and Perea-Ortega, José M. BRUJA System. The University of Jaén at the Spanish task of QA@CLEF 2006. In LNCS, volume 4730, pages 328-338. Springer-Verlag, 2007.
GeoCLEF 2008, Aarhus
http://sinai.ujaen.es