Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently...

Post on 26-Sep-2020

7 views 0 download

Transcript of Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently...

Thesauri Quality Assessment: Analyzing the Rijksmuseum Library Thesaurus

By: Daan de RuijterSupervised by: Jacco van Ossenbruggen &

Chris Dijkshoorn

Agenda

▪ Background information and context▪ Research layout▪ Results▪ Conclusion▪ Questions and Discussion

2

What is a Thesaurus?

A thesaurus is a:➔ Structured vocabulary➔ Describing Concepts➔ According to a predefined format

A characterizing feature of thesauri is the hierarchy between different Concepts.➔ A hierarchy helps searching, navigating and maintaining the

vocabulary

3

An Example of how a Thesaurus Describes a Concept:

Archeologie

What falls under this concept?➔ All literature about “archeologie”➔ This is a design choice

What kind information is needed for a thesaurus to describe this concept?➔ A unique identifier (ID)➔ A prefered label (in one or multiple languages)➔ Alternative labels (e.g. synonyms or verb forms)➔ Hierarchical relations with other concepts

◆ Broader, Narrower or Related➔ Time related data (creation date, last modification)

4

Context

The Rijksmuseum Library thesaurus is currently maintained in the MARC format➔ This is a somewhat dated format➔ The thesaurus has manual data entry➔ Entries have a lack of quality assurance

Main research focus:How can we assess the quality of such a thesaurus?

5

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

6

Research Layout

7

3. Align Library and Collection Thesauri

To see how well the two can be integrated.

Mainly done through string matching.

1. Convert MARC to SKOS

SKOS supports tools and standards for quality analysis.

Done by directly mapping XML tags with an XSL Transformations.

2. Analyze SKOS Quality Issues

With formalized methods defined by previous studies.

Done with standard SKOS tools and custom python scripts.

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

8

Data Structure: MARC<record>

<leader>00485nz a2200169o 4500</leader>

<controlfield tag="001">126536</controlfield>

<controlfield tag="003">NL-AmRIJ</controlfield>

<controlfield

tag="005">20141121114503.0</controlfield>

<controlfield tag="008">091231

||az||||||||||||||||||||||||||| d</controlfield>

<datafield tag="040" ind1=" " ind2=" ">

<subfield code="a">NL-AmRIJ</subfield>

<subfield code="b">dut</subfield>

<subfield code="c">NL-AmRIJ</subfield>

<subfield code="e">fobidrtb</subfield>

</datafield>

<datafield tag="150" ind1=" " ind2=" ">

<subfield code="a">archeologie</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">klassieke oudheid</subfield>

<subfield code="0">(NLAmRIJ)126561</subfield>

</datafield>

9

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">archeologische sites</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">onderwaterarcheologie</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">pre- en protohistorie</subfield>

<subfield code="0">(NLAmRIJ)126580</subfield>

</datafield>

<datafield tag="680" ind1=" " ind2=" ">

<subfield code="i">Vertaling: archaeology</subfield>

</datafield>

<datafield tag="942" ind1=" " ind2=" ">

<subfield code="a">TOPIC_TERM</subfield>

</datafield>

</record>

Data Structure: MARC

<record>

<leader>00485nz a2200169o 4500</leader>

<controlfield tag="001">126536</controlfield>

<controlfield tag="003">NL-AmRIJ</controlfield>

<controlfield tag="005">20141121114503.0</controlfield>

<controlfield tag="008">091231 ||az||||||||||||||||||||||||||| d</controlfield>

<datafield tag="150" ind1=" " ind2=" ">

<subfield code="a">archeologie</subfield>

</datafield>

<datafield tag="680" ind1=" " ind2=" ">

<subfield code="i">Vertaling: archaeology</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">klassieke oudheid</subfield>

<subfield code="0">(NLAmRIJ)126561</subfield>

</datafield>

10

Record ID

Record “nl” label

Record “en” label

Hierarchical relation

Data Structure: SKOS

<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126536">

<skos:prefLabel xml:lang="nl">archeologie</skos:prefLabel>

<skos:prefLabel xml:lang="en">archaeology</skos:prefLabel>

<skos:narrower rdf:resource="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126561"/>

<skos:narrower rdf:resource="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126580"/>

<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.TOPIC_TERM"/>

<skos:changeNote>

<rdf:Description>

<dct:modified>2014-11-21</dct:modified>

</rdf:Description>

</skos:changeNote>

<dct:created>2009-12-31</dct:created>

</skos:Concept>

11

Concept IDConcept “nl” label

Concept “en” label

Hierarchical relation

Converting the Thesaurus: MARC 550 Tag Errors. What are they?

12

In our concept “archeologie”:➔ Amount of hierarchical relations in MARC: 4➔ Amount of hierarchical relations in SKOS: 2What happened?

Converting the Thesaurus: MARC 550 Tag Errors. What are they?

13

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">klassieke oudheid</subfield>

<subfield code="0">(NLAmRIJ)126561</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">archeologische sites</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">onderwaterarcheologie</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">pre- en protohistorie</subfield>

<subfield code="0">(NLAmRIJ)126580</subfield>

</datafield>

Concept ID

No code “0”

No code “0”

MARC 550 Tag Errors (N = 14828)

code “w” code “a” code “0”

Error count 9 37 875

Correct entry example h boekwetenschap (NL-AmRIJ)126543

Entry error examples

w NULL NULL

9 mariaverering

(NL-AmRIJ)131820 (NL-AmRIJ)#129341

hippodromen (NL-AmRIJ) 14

Each error represent a hierarchical relation that cannot be converted to SKOS

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

15

Language Coverage (N = 7826)

16

Low amount of English terms or alternative labels could be seen as a quality issue➔ This depends on the intended use of the thesaurus

nl en

prefLabel 7826 60

altLabel 1149 0

Quality Analysis Results (for all: N = 7826)

17

Quality Issue Count in MARC Count in SKOS After Skosify

Omitted or Invalid Language Tags 0 0 0

Incomplete Language Coverage 7766 7766 7766

No Common Language 0 0 0

Overlapping Labels 29 29 29

Empty Labels 0 0 0

Orphan Concepts 976 1391 1364

Cyclic Hierarchical Relations unknown 23 0

Valueless Associative Relations unknown 183 0

Omitted Top Concepts unknown 2043 0

Concept without a hierarchical

relation

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

18

What is an alignment?

An alignment is a:➔ Concept from two different thesauri that is found to be identical

Alignments are most commonly found by:➔ Exactly matching concept labels➔ Matching modified labels (stemming, lemmatization)➔ Comparing concept structures

19

An Alignment Example

Concept in the library thesaurus:

<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.LIBTHESAU.129814">

<skos:prefLabel xml:lang="nl">kunstenaars</skos:prefLabel>

<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.TOPIC_TERM"/>

Concept in the collection thesaurus:

<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.THESAU.38160">

<skos:prefLabel xml:lang="nl">kunstenaar</skos:prefLabel>

<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.OCCUPATION"/>

20

This concept is seen as a topic term

This concept is seen as an occupation

Library Thesaurus Alignment onto the Collection Thesaurus Using Exact String Matching (N = 7826)

Selected Label Type

Selected Languages

Aligned Concepts

Aligned Concepts after Stemming

Percentage of Aligned Concepts

after Stemming

skos:prefLabel, skos:altLabel

nl, en 844 1030 13.16%

nl 840 1024 13.08%

en 3 4 0.05%

skos:prefLabel

nl, en 729 894 11.42%

nl 726 890 11.37%

en 3 4 0.05%

skos:altLabel

nl, en 13 16 0.20%

nl 13 16 0.20%

en 0 0 0.00%

21

Before stemming

After stemming

Conclusion

▪ The mapping from MARC to SKOS tags proved to be a viable conversion method.

▪ SKOS provided both better insight into quality issues, and was supported by tools to fix them.

▪ The amount of alignments between the Rijksmuseum thesauri was low.

22

What’s in it for the Rijksmuseum?

23

▪ Converting the thesaurus from MARC to SKOS would allow for better maintainability and interoperability

▪ Improving the thesaurus quality in terms of documentation and structure allows for more alignments to be made with other thesauri

THANK YOU FOR YOUR ATTENTION

Are there any questions?

Follow this project on Github:Special thanks to:Jacco van Ossenbruggen (VU - supervision)Chris Dijkshoorn (Rijksmuseum - supervision)Contact me:d.a.c.de.ruijter@student.vu.nl 24