1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch Max Stempfhuber GESIS...
-
Upload
paul-roberts -
Category
Documents
-
view
216 -
download
0
description
Transcript of 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch Max Stempfhuber GESIS...
1
The Domain-Specific Track at CLEF 2007
Vivien Petras, Stefan Baerisch & Max StempfhuberGESIS Social Science Information Centre,Bonn, Germany
Budapest, September 19, 2007
2
Outline
• The Domain-Specific Task• Collections & Controlled Vocabularies• Topics • Participants, Runs & Relevance Assessments• Themes • Summary & Outlook
3
The Domain-Specific Task
CLIR on structured scientific document collections:• social science domain• bibliographic metadata• controlled vocabularies for subject description
Leverage bibliographic metadata & controlled vocabularies for:
• search• translation
4
The Domain-Specific Task
Tasks:• Monolingual against German, English or Russian• Bilingual against German, English or Russian• Multilingual against combined collection
5
Collections
German English RussianName GIRT-DE GIRT-EN CSA-SA ISISSDescription German social
science literature & projects
GIRT-DEtranslated
Sociolog. Abstracts
Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science
Coverage 1990-2000 1990-2000 1994-1996Docs 151,319 151,319 20,000 145,802Abstracts 96% 17% 94% 27%
6
Controlled Vocabularies
GIRT CSA-SA ISISSDescriptors / doc 10 6.4 3.9Class. codes / doc 2 1.3 n/a
5 different subject-describing terminologies:• Thesaurus for the Social Sciences (GIRT-DE, -EN)• Thesaurus of Sociological Indexing Terms (CSA-SA)• INION Thesaurus (ISISS)• Social Sciences Classification (GIRT-DE, -EN)• Sociological Abstracts Classification (CSA-SA)
7
Controlled Vocabularies – Mapping Tools
Translation:• GIRT German GIRT English
Intellectual term mappings (cross-walks):• equivalent terms in vocabularies• GIRT German CSA-SA English • GIRT English CSA-SA English
original-term: agricultural area mapped-term: Rural areas
8
Topics
25 topics in standard TREC format (title, desc, narr):
• 15 volunteers (social scientists)• 2-5 suggestions from 28 subject specialties• checked for:
• coverage in collections• variance from previous years
• translated into English, Russian
9
Participants
5 groups Group Institution Country
ChemnitzMedia InformaticsChemnitz University of Technology
Germany
Cheshire School of Information UC Berkeley USA
Moscow Moscow State University Russia
Unine Computer Science DepartmentUniversity of Neuchatel
Switzerland
Xerox Data Mining GroupXerox Research Centre Europe France
10
Runs
Task Runs2007
Runs 2006
Runs 2005
Monolingual - against German
13 13 17
- against English
15 8 15
- against Russian
11 1 8
Bilingual - against German
14 6 15
- against English
15 3 13
- against Russian
9 3 5
Multilingual 9 2 3Total 86 36 76
11
Relevance Assessments
German English Russian
Pool size 16,288 17,867 14,473
Rel. Docs 2007 22% 25% 10%*
Rel. docs 2006 39% 26% n/a
Rel. docs 2005 20% 21% 9% (RSSC)
* In Russian collection: 3 topics without relevant topics
All assessments done with Univ. of Padova‘s DIRECT System.
12
Relevance Assessments – Best MAP
Task MAP2007
MAP 2006
MAP2005
Monolingual - against German
0.5051 0.5454 0.4936
- against English
0.3534 0.4576 0.5065
- against Russian
0.1971 0.2542 0.3038
Bilingual - against German
0.4568 (90%)
0.2448 (45%) 0.4201 (85%)
- against English
0.3341 (95%)
0.3301 (72%) 0.4743 (94%)
- against Russian
0.1348 (68%)
0.1648 (62%) 0.2331 (77%)
Multilingual 0.0884 0.0753 0.0532
13
Themes - Retrieval models
• Lucene • Language Modelling • Logistic Regression• Comparison: Vector Space, LM, Probabilistic - Okapi, DFR
• Data fusion
• Russian• word-based vs. N-gram retrieval• new light-weight stemmer
14
Themes – Query Expansion
Entry Vocabulary Modules • query terms associated with thesaurus terms from documents
Thesaurus Lookup• combined thesaurus from all CVs • GIRT Thesaurus Index
Lexical Entailment • find document terms in relation to query terms
Blind Feedback
15
Themes – Translation
Lucene plug-in • Babelfish, Google, PROMT, Reverso
Bilingual thesaurus mapping
Dictionary adaption • disambiguate term translation given language context of feedback documents
Statistical machine translation• MATRAX
Commercial Software
16
Summary & Outlook
Extension of Russian materials• Translation table DE-EN-RU for GIRT Thesaurus• Translation table RU-EN for INION Thesaurus• Mapping between GIRT – INION Thesaurus
More tools for Terminology mapping• different relationships (0T, SYN, BT, NT, RT)• GESIS-IZ project: > 40 mappings
• 25 controlled vocabularies / 11 disciplines • ~ 125,000 terms & phrases • ~ 400,000 relations
17
Domain-Specific Track:http://www.gesis.org/en/research/information_technology/clef_ds_2007.htm
Vocabulary Mappings:http://www.gesis.org/en/research/information_technology/komohe.htm
Email:[email protected]