SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea...

15
SINAI-GIR SINAI-GIR A Multilingual Geographical IR A Multilingual Geographical IR System System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer Science Department

Transcript of SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea...

Page 1: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

SINAI-GIRSINAI-GIR

A Multilingual Geographical IR SystemA Multilingual Geographical IR System

University of Jaén (Spain)

José Manuel Perea Ortega

CLEF 2008, 18 September, Aarhus (Denmark)

Computer Science Department

Page 2: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Introduction

• Preliminary work of SINAI in GeoCLEF: – 2006: query expansion using gazetteers and

thesaurus [García-Vega et al., 2007]– 2007: filtering documents based on manual rules

[Perea-Ortega et al., 2007]

• GeoCLEF 2008:– Filtering documents using new manual rules and

new approachs (query reformulation, keywords and hyponyms extraction, query geo-expansion)

GeoCLEF 2008, Aarhus

Page 3: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Page 4: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Translates the queries from other languages into English

We have used SINTRAM (SINai TRAnslation Module) [García-Cumbreras et al., 2007]

It works with different online machine translators

Page 5: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Preprocessing: stemming, stopwords, POS The toponyms are extracted (NER) Two indexes are generated:

• Locations• Keywords

Page 6: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Query Preprocessing: stemming, stopwords, removes irrelevant information

The toponyms are extracted (NER) Spatial relations finder based on manual rules Query reformulation based on POS tagging and

query parsing subtask Geo-expansion using a gazetteer Keywords/Hyponyms detection

Page 7: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Lemur as index-search engine

Okapi with PRF as weighting function

Page 8: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Filter the list of documents recovered by the IR subsystem, applying different manual rules and using the geographical data detected in the query

Re-rank the documents using predefined weights for each rule and the keywords/hyponyms detected in the query

Page 9: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Experiments description

• SINAI has participated in mono and bilingual tasks with a total of 15 experiments15 experiments:– MONO-EN: 9 experiments– BILI-X2EN: 6 experiments

• Combining the content of topic labels: TD or TDN• BaselineBaseline: Q1 without applying any filtering or re-

ranking process• Other experimentsOther experiments:

– Filtering and re-ranking of the fusion list of the documents recovered by the Q1, Q2 and Q3

– Using keywords and/or hyponyms in the re-ranking process

GeoCLEF 2008, Aarhus

Page 10: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

MONO-EN results

GeoCLEF 2008, Aarhus

Best result: baselinebaseline (no filtering and no re-ranking)

In some filtering experiments the use of keywords improves the results

Best results using only the TD topic labels

Page 11: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

BILI-X2EN results

GeoCLEF 2008, Aarhus

Best result: baselinebaseline (no filtering and no re-ranking) with Portuguese topics

Best results using only the TD topic labels

Page 12: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Conclusions

• The baseline experiment seems to work well because we include the geo-information in the retrieval process

• The filtering of documents does not seem to work well because we include the geo-information in the query and we are re-ranking documents which maybe are not relevant with respect to their content

• The use of keywords for re-ranking the documents retrieved could be interesting because in some experiments it improves the results obtained without using them

• Query reformulation could be also interesting because for some topics it retrieves valid documents which are not retrieved with the default query

GeoCLEF 2008, Aarhus

Page 13: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

TextMESS at GeoCLEF 2008

• Spanish TextMESS projectTextMESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies): joint participation by the Polytechnic University of Valencia and University of Jaén (SINAI)

• Method employed: merging algorithm based on merging algorithm based on fuzzy Borda voting schemefuzzy Borda voting scheme, taking as input the , taking as input the two document lists returned by both systemstwo document lists returned by both systems

• Second best result in the monolingual English task

GeoCLEF 2008, Aarhus

Page 14: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Thank you

GeoCLEF 2008, Aarhus

sinai.ujaen.es

Page 15: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

• References

– García-Vega, Manuel and García-Cumbreras, Miguel A. and Ureña-López, L.A. and Perea-Ortega, José M. GEOUJA System. The first participation of the University of Jaén at GEOCLEF 2006. In LNCS, volume 4730, pages 913-917. Springer-Verlag, 2007.

– Perea-Ortega, Jose M. and García-Cumbreras, Miguel A. and García-Vega, Manuel and Montejo-Ráez, Arturo. GEOUJA System. University of Jaén at GEOCLEF 2007. In Proceedings of the Cross Language Evaluation Forum (CLEF 2007), page 52, 2007.

– García-Cumbreras, Miguel A. and Ureña-López, L. Alfonso and Martínez-Santiago, Fernando and Perea-Ortega, José M. BRUJA System. The University of Jaén at the Spanish task of QA@CLEF 2006. In LNCS, volume 4730, pages 328-338. Springer-Verlag, 2007.

GeoCLEF 2008, Aarhus

http://sinai.ujaen.es