1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch Max Stempfhuber GESIS...

17
1 The Domain-Specific Track at CLEF 200 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest, September 19, 2007

description

3 The Domain-Specific Task CLIR on structured scientific document collections: social science domain bibliographic metadata controlled vocabularies for subject description Leverage bibliographic metadata & controlled vocabularies for: search translation

Transcript of 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch Max Stempfhuber GESIS...

Page 1: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

1

The Domain-Specific Track at CLEF 2007

Vivien Petras, Stefan Baerisch & Max StempfhuberGESIS Social Science Information Centre,Bonn, Germany

Budapest, September 19, 2007

Page 2: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

2

Outline

• The Domain-Specific Task• Collections & Controlled Vocabularies• Topics • Participants, Runs & Relevance Assessments• Themes • Summary & Outlook

Page 3: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

3

The Domain-Specific Task

CLIR on structured scientific document collections:• social science domain• bibliographic metadata• controlled vocabularies for subject description

Leverage bibliographic metadata & controlled vocabularies for:

• search• translation

Page 4: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

4

The Domain-Specific Task

Tasks:• Monolingual against German, English or Russian• Bilingual against German, English or Russian• Multilingual against combined collection

Page 5: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

5

Collections

German English RussianName GIRT-DE GIRT-EN CSA-SA ISISSDescription German social

science literature & projects

GIRT-DEtranslated

Sociolog. Abstracts

Inst. of Scientific Inf. for Soc. Sc. of the Ru. Acad. of Science

Coverage 1990-2000 1990-2000 1994-1996Docs 151,319 151,319 20,000 145,802Abstracts 96% 17% 94% 27%

Page 6: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

6

Controlled Vocabularies

GIRT CSA-SA ISISSDescriptors / doc 10 6.4 3.9Class. codes / doc 2 1.3 n/a

5 different subject-describing terminologies:• Thesaurus for the Social Sciences (GIRT-DE, -EN)• Thesaurus of Sociological Indexing Terms (CSA-SA)• INION Thesaurus (ISISS)• Social Sciences Classification (GIRT-DE, -EN)• Sociological Abstracts Classification (CSA-SA)

Page 7: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

7

Controlled Vocabularies – Mapping Tools

Translation:• GIRT German GIRT English

Intellectual term mappings (cross-walks):• equivalent terms in vocabularies• GIRT German CSA-SA English • GIRT English CSA-SA English

original-term: agricultural area mapped-term: Rural areas

Page 8: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

8

Topics

25 topics in standard TREC format (title, desc, narr):

• 15 volunteers (social scientists)• 2-5 suggestions from 28 subject specialties• checked for:

• coverage in collections• variance from previous years

• translated into English, Russian

Page 9: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

9

Participants

5 groups Group Institution Country

ChemnitzMedia InformaticsChemnitz University of Technology

Germany

Cheshire School of Information UC Berkeley USA

Moscow Moscow State University Russia

Unine Computer Science DepartmentUniversity of Neuchatel

Switzerland

Xerox Data Mining GroupXerox Research Centre Europe France

Page 10: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

10

Runs

Task Runs2007

Runs 2006

Runs 2005

Monolingual - against German

13 13 17

- against English

15 8 15

- against Russian

11 1 8

Bilingual - against German

14 6 15

- against English

15 3 13

- against Russian

9 3 5

Multilingual 9 2 3Total 86 36 76

Page 11: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

11

Relevance Assessments

German English Russian

Pool size 16,288 17,867 14,473

Rel. Docs 2007 22% 25% 10%*

Rel. docs 2006 39% 26% n/a

Rel. docs 2005 20% 21% 9% (RSSC)

* In Russian collection: 3 topics without relevant topics

All assessments done with Univ. of Padova‘s DIRECT System.

Page 12: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

12

Relevance Assessments – Best MAP

Task MAP2007

MAP 2006

MAP2005

Monolingual - against German

0.5051 0.5454 0.4936

- against English

0.3534 0.4576 0.5065

- against Russian

0.1971 0.2542 0.3038

Bilingual - against German

0.4568 (90%)

0.2448 (45%) 0.4201 (85%)

- against English

0.3341 (95%)

0.3301 (72%) 0.4743 (94%)

- against Russian

0.1348 (68%)

0.1648 (62%) 0.2331 (77%)

Multilingual 0.0884 0.0753 0.0532

Page 13: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

13

Themes - Retrieval models

• Lucene • Language Modelling • Logistic Regression• Comparison: Vector Space, LM, Probabilistic - Okapi, DFR

• Data fusion

• Russian• word-based vs. N-gram retrieval• new light-weight stemmer

Page 14: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

14

Themes – Query Expansion

Entry Vocabulary Modules • query terms associated with thesaurus terms from documents

Thesaurus Lookup• combined thesaurus from all CVs • GIRT Thesaurus Index

Lexical Entailment • find document terms in relation to query terms

Blind Feedback

Page 15: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

15

Themes – Translation

Lucene plug-in • Babelfish, Google, PROMT, Reverso

Bilingual thesaurus mapping

Dictionary adaption • disambiguate term translation given language context of feedback documents

Statistical machine translation• MATRAX

Commercial Software

Page 16: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

16

Summary & Outlook

Extension of Russian materials• Translation table DE-EN-RU for GIRT Thesaurus• Translation table RU-EN for INION Thesaurus• Mapping between GIRT – INION Thesaurus

More tools for Terminology mapping• different relationships (0T, SYN, BT, NT, RT)• GESIS-IZ project: > 40 mappings

• 25 controlled vocabularies / 11 disciplines • ~ 125,000 terms & phrases • ~ 400,000 relations

Page 17: 1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch  Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,

17

Domain-Specific Track:http://www.gesis.org/en/research/information_technology/clef_ds_2007.htm

Vocabulary Mappings:http://www.gesis.org/en/research/information_technology/komohe.htm

Email:[email protected]