Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing...

16
Péter Schönhofen – Ad Hoc Hungarian → English CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad Hoc bilingual Hungarian →English joint work with András A. Benczúr, István Bíró, Károly Csalogány Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Péter Schönhofen

Transcript of Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing...

Page 1: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Performing Cross-Language Retrieval with Wikipedia

Participation report for Ad Hoc bilingualHungarian →English

joint work with

András A. Benczúr, István Bíró, Károly Csalogány

Data Mining and Web Search GroupComputer and Automation Research Institute

Hungarian Academy of Sciences

Péter Schönhofen

Page 2: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Our Approach

• Term-by-term query translation by dictionaries

• Bigram language model helps select the most probable English translation

• Using Wikipedia to discard off-topic terms

IR System: Hungarian Academy of Sciences Search

Engine (http://search.sztaki.hu) TF×IDF-based OR query, heavily weighted by # matched terms Also taking into account proximity and term location

Use only query title; description and narrative contributes to mapping title to Wikipedia concepts

Page 3: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Outline of the algorithm

• Preparations construct a dictionary generate concept network from Wikipedia pre-process queries and documents

• Raw translation disambiguation with bigram model

• Improve translation quality with Wikipedia map terms to concept space rank concepts map concepts to words

Page 4: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Dictionary Construction

• Two sources of Hungarian-English term pairs: On-line dictionary of the Institute

(official + community edited entries) cross-language links present in Wikipedia

• Select conflicting entries in above order(official, community, Wikipedia)

• 100,510 dictionary entries in total(however, large portion is idiom)

Page 5: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Raw translation

• Find Hungarian dictionary terms in queries Hungarian terms may overlap

• Select best translations based on bigram model a translation is better if it joins to other translations

through bigrams with higher probability Wikipedia model used but any other large corpus

suffices

queryHungarian word

Translation candidate 1

score by bigram model

Translation candidate 2

Translation candidate 1

Translation candidate 2

max

Page 6: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Role of Wikipedia

Page 7: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Concept network

• Regular Wikipedia articles represent concepts article title is concept name links to other articles describe semantic relations redirections are handled as additional concept

names(sort of synonyms)

• Category assignments are ignored

• Wikipedia is in fact converted to an ontology less formal than a proper ontology (e. g. WordNet) only one type of relationship exists

Page 8: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Map terms to concepts

• Match Wikipedia article titles with query terms

• Concepts behind Wikipedia article titles: the same title may represent multiple concepts another layer of disambiguation is introduced

• Concepts are recognized through terms, and are carried by text locations occupied by the term

Page 9: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Rank concepts

• Select concepts which are the most tightly connected to other candidate concepts

• Score of concept C computed from three factors: L: # text locations carrying concepts

semantically related to C; M: # concepts carried by the same text locations

as C; F: # text locations carrying CSc=Lc×

11Mc

×FcF

MLSC

1

1

Page 10: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Map concepts to words

1. Concepts → titles (word sequences)

pasting titles would yield too long queries

2. Titles → set of words

3. Words are ranked based on the scores of concepts behind them

the same word may represent many concepts

4. Query title words required

if all translations of a title word discarded, forcefully injected into the translated query

Page 11: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Why use Wikipedia?

• Advantages freely available (snapshots are downloadable) relatively high-quality wide range of subjects covered rapidly growing, up-to-date

• Disadvantages articles not always link to other relevant articles category assignments not always consistent basic verbs and nouns are not covered

Page 12: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Example query

• Original query title:“cancer research”

• Raw translation:“oncology”

• Improved translation:“oncology cancer treatment”

Page 13: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Evaluation

Page 14: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Difficulties

• Hungarian stemmer is not perfect language is complex pronouns not always recognized as such

• Dictionary is small

• In short: raw translation is of very low quality

• Retrieval is not performed on the concept level

• Context is not large enough to support the reliable selection of relevant Wikipedia concepts

Page 15: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Future work

• Performing German queries against English corpora

• More rich dictionary

• Improved mechanism raw translation is used for retrieval Wikipedia concept network is used for

determining relevance of documents in hit-lists: query-document matching carried out in the space of Wikipedia concepts

• Improved matching POS information also taken into account

Page 16: Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007

Thank you for your attention