Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing...
-
Upload
victoria-morgan -
Category
Documents
-
view
212 -
download
0
Transcript of Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing...
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Performing Cross-Language Retrieval with Wikipedia
Participation report for Ad Hoc bilingualHungarian →English
joint work with
András A. Benczúr, István Bíró, Károly Csalogány
Data Mining and Web Search GroupComputer and Automation Research Institute
Hungarian Academy of Sciences
Péter Schönhofen
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Our Approach
• Term-by-term query translation by dictionaries
• Bigram language model helps select the most probable English translation
• Using Wikipedia to discard off-topic terms
IR System: Hungarian Academy of Sciences Search
Engine (http://search.sztaki.hu) TF×IDF-based OR query, heavily weighted by # matched terms Also taking into account proximity and term location
Use only query title; description and narrative contributes to mapping title to Wikipedia concepts
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Outline of the algorithm
• Preparations construct a dictionary generate concept network from Wikipedia pre-process queries and documents
• Raw translation disambiguation with bigram model
• Improve translation quality with Wikipedia map terms to concept space rank concepts map concepts to words
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Dictionary Construction
• Two sources of Hungarian-English term pairs: On-line dictionary of the Institute
(official + community edited entries) cross-language links present in Wikipedia
• Select conflicting entries in above order(official, community, Wikipedia)
• 100,510 dictionary entries in total(however, large portion is idiom)
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Raw translation
• Find Hungarian dictionary terms in queries Hungarian terms may overlap
• Select best translations based on bigram model a translation is better if it joins to other translations
through bigrams with higher probability Wikipedia model used but any other large corpus
suffices
queryHungarian word
Translation candidate 1
score by bigram model
Translation candidate 2
Translation candidate 1
Translation candidate 2
max
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Role of Wikipedia
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Concept network
• Regular Wikipedia articles represent concepts article title is concept name links to other articles describe semantic relations redirections are handled as additional concept
names(sort of synonyms)
• Category assignments are ignored
• Wikipedia is in fact converted to an ontology less formal than a proper ontology (e. g. WordNet) only one type of relationship exists
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Map terms to concepts
• Match Wikipedia article titles with query terms
• Concepts behind Wikipedia article titles: the same title may represent multiple concepts another layer of disambiguation is introduced
• Concepts are recognized through terms, and are carried by text locations occupied by the term
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Rank concepts
• Select concepts which are the most tightly connected to other candidate concepts
• Score of concept C computed from three factors: L: # text locations carrying concepts
semantically related to C; M: # concepts carried by the same text locations
as C; F: # text locations carrying CSc=Lc×
11Mc
×FcF
MLSC
1
1
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Map concepts to words
1. Concepts → titles (word sequences)
pasting titles would yield too long queries
2. Titles → set of words
3. Words are ranked based on the scores of concepts behind them
the same word may represent many concepts
4. Query title words required
if all translations of a title word discarded, forcefully injected into the translated query
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Why use Wikipedia?
• Advantages freely available (snapshots are downloadable) relatively high-quality wide range of subjects covered rapidly growing, up-to-date
• Disadvantages articles not always link to other relevant articles category assignments not always consistent basic verbs and nouns are not covered
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Example query
• Original query title:“cancer research”
• Raw translation:“oncology”
• Improved translation:“oncology cancer treatment”
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Evaluation
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Difficulties
• Hungarian stemmer is not perfect language is complex pronouns not always recognized as such
• Dictionary is small
• In short: raw translation is of very low quality
• Retrieval is not performed on the concept level
• Context is not large enough to support the reliable selection of relevant Wikipedia concepts
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Future work
• Performing German queries against English corpora
• More rich dictionary
• Improved mechanism raw translation is used for retrieval Wikipedia concept network is used for
determining relevance of documents in hit-lists: query-document matching carried out in the space of Wikipedia concepts
• Improved matching POS information also taken into account
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007
Thank you for your attention