Terminology for information retrieval: effectiveness of cross- · PDF file ·...
Transcript of Terminology for information retrieval: effectiveness of cross- · PDF file ·...
1
Term ino log y fo r in fo rm ation retrieval:ef fectiveness o f cross-concord ances
Philip p M ayr & Vivien PetrasGESIS Socia l Science In fo rm ation Centre, Bonn , Germ any
Knowledge organization on the WebISKO-IWA meeting
Nap les, Ita ly, 5 Sep tem ber 2 0 0 8
2
German Social Science Infrastructure Services
www.gesis.org
• Digital Library
• Data archive
• Consulting
• Surveys & studies
• Society observation
3
German Social Science Infrastructure Services
Document types:
• Bibliographic• Full texts• Project data• Institutions• Web pages• Statistical data• Surveys• People
Disciplines:
• Sociology• Political Science• Education• Psychology• Economics• Business Administration
4
Heterogeneous collections
• Many databases:– document types / formats– vocabularies
• controlled vocabularies:– internal consistency ⇧– intersystem compatibility ⇩ (semantic heterogeneity)
• Solution: translate cross-walks terminologymapping
5
KoMoHe Project
• September 2004 – 2007• Goals:
– Models for searching heterogeneous collections
– Development, organization & management ofcross-walks between controlled vocabularies
6
Terminology Mapping Initiatives
• OCLC Terminology Services– DDC, LCC, LCSH, Mesh
• MACS (Multilingual Access to Subjects)– LCSH – Rameau – SWD
• CARMEN– SWD, TheSoz, STW, …
• Criss-Cross– SWD – DDC
7
Cross-concordances
= manually created, directed relations betweencontrolled terms of two knowledge organizationsystems (KOS)
KOS 1 Computer
KOS 2 Information
System
KOS 1
Database
KOS 2
Information System
KOS 1
100%KOS 2
50% mapped
KOS 3
40%
KOS 4
60% mapped
8
Relations
• Equivalence
• Narrower Term
• Broader Term
• Related Term
• Null: no mapping
KOS< Thesaurus
Speciallibrary>Library
0 Virus
Computers +Security^ Hacker
Bibliothéque= Library
KOS 2RelationKOS 1
9
Cross-concordances
• 25 Vocabularies in 60 cross-concordances– Thesauri (16)– Descriptor lists (4)– Classifications (3)– Subject heading lists (2)
• 380,000 mapped terms• 465,000 relations• 205,000 equivalence relations• 13 German, 8 English, 1 Russian, 3 multilingual
1 0
Disciplines
Social
Sciences (10)
Gerontology
(1)
Universal (3)
Psychology
(1)
Pedagogics
(1)
Sports
science (2)
Economics
(2)
Political
science (3)
Medicine (1)Agricultural
science (1)
Information
science (1)
1 1
Differences
• Vocabulary type:– Thesaurus – Thesaurus– Classification – Thesaurus– (Classification – Classification)– (Thesaurus – Descriptor list)
• Change of discipline
• Change of language
• Size
• Combination / compounds
1 2
Identical
equivalence
21%
Narrower Term
9%
Broader Term
20%
Equivalence
(Synonym)
24%
Null Mapping
12%
Related Term
14%
Thesaurus – Thesaurus
1 3
Classification – Thesaurus (JEL – STW)
Narrower
Term
75%
Related
Term
8% Broader
Term
11%
Equivalence
6%
1 4
Information Retrieval Tests
GOAL: Facilitate search across different databases
Navigate without semantic borders!
• Translate search terms into other terminologies
• Increase diversity of documents
• Improve search experience without effort forsearcher
1 5
Information Retrieval Tests
1. Do mappings improve subject search?
CT (start vocabulary) TT Destination database
Fam ily relations Fam ily AND socia l relations
2. Do mappings improve free-text search?
FT (start vocabulary) FT + TT Destination database
Fam ily relations Fam ily relations OR (Fam ily ANDsocia l relations)
1 6
Information Retrieval Tests
• Thesaurus mapping only
• Only equivalence relations
• Real queries (~6 per tested cross-concordance)
• Databases: 80,000 – 16 mio. documents
• Test 1 (CT TT): 13 Cross-concordances
• Test 2 (FT FT+TT): 8 Cross-concordances
1 7
Information Retrieval Tests - Results
• CT TT (Improvements in %)
+68%+136% Interdisciplinary+34%+39% Intradisciplinary
Precision= Accuracy
Recall= Hitrate
-24%+24% Interdisciplinary-12%+20% Intradisciplinary
Precision= Accuracy
Recall= Hitrate
• FT FT+TT (Improvements in %)
1 8
Sowiport Portal: www.sowiport.de
1 9
Sowiport Thesaurus
2 0
Sowiport Search
2 1
Conclusion
• Cross-concordances improve subject search withcontrolled terms & free-text search
• Only 24% relations utilized (equivalence)
• Potential:– Other relations– Natural language query terms CT translation
• More mappings which are not evaluated
• Sowiport: http://www.sowiport.de
2 2
Publications
Mayr, Philipp; Petras, Vivien (2008): Cross-concordances: terminologymapping and its effectiveness for information retrieval. In: 74th IFLAWorld Library and Information Congress. Québec, Canada-http://www.ifla.org/IV/ifla74/papers/129-Mayr_Petras-en.pdf
Mayr, Philipp; Mutschke, Peter; Petras, Vivien (2008): Reducingsemantic complexity in distributed Digital Libraries: treatment of termvagueness and document re-ranking. In: Library Review. 57 (2008) 3.pp. 213-224. http://arxiv.org/abs/0712.2449
Mayr, Philipp; Petras, Vivien (2008 to appear): Building a terminologynetwork for search: the KoMoHe project. In: International Conferenceon Dublin Core and Metadata Applications.
2 3
„Databases without semantic borders“
KoM oHe Pro ject
http ://www.g esis.o rg /en/research/in form ation _ techno log y/kom ohe.htm
E-m ail: [email protected] .p etras@ g esis.o rg