Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently...

24
Thesauri Quality Assessment: Analyzing the Rijksmuseum Library Thesaurus By: Daan de Ruijter Supervised by: Jacco van Ossenbruggen & Chris Dijkshoorn

Transcript of Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently...

Page 1: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Thesauri Quality Assessment: Analyzing the Rijksmuseum Library Thesaurus

By: Daan de RuijterSupervised by: Jacco van Ossenbruggen &

Chris Dijkshoorn

Page 2: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Agenda

▪ Background information and context▪ Research layout▪ Results▪ Conclusion▪ Questions and Discussion

2

Page 3: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

What is a Thesaurus?

A thesaurus is a:➔ Structured vocabulary➔ Describing Concepts➔ According to a predefined format

A characterizing feature of thesauri is the hierarchy between different Concepts.➔ A hierarchy helps searching, navigating and maintaining the

vocabulary

3

Page 4: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

An Example of how a Thesaurus Describes a Concept:

Archeologie

What falls under this concept?➔ All literature about “archeologie”➔ This is a design choice

What kind information is needed for a thesaurus to describe this concept?➔ A unique identifier (ID)➔ A prefered label (in one or multiple languages)➔ Alternative labels (e.g. synonyms or verb forms)➔ Hierarchical relations with other concepts

◆ Broader, Narrower or Related➔ Time related data (creation date, last modification)

4

Page 5: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Context

The Rijksmuseum Library thesaurus is currently maintained in the MARC format➔ This is a somewhat dated format➔ The thesaurus has manual data entry➔ Entries have a lack of quality assurance

Main research focus:How can we assess the quality of such a thesaurus?

5

Page 6: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

6

Page 7: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Research Layout

7

3. Align Library and Collection Thesauri

To see how well the two can be integrated.

Mainly done through string matching.

1. Convert MARC to SKOS

SKOS supports tools and standards for quality analysis.

Done by directly mapping XML tags with an XSL Transformations.

2. Analyze SKOS Quality Issues

With formalized methods defined by previous studies.

Done with standard SKOS tools and custom python scripts.

Page 8: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

8

Page 9: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Data Structure: MARC<record>

<leader>00485nz a2200169o 4500</leader>

<controlfield tag="001">126536</controlfield>

<controlfield tag="003">NL-AmRIJ</controlfield>

<controlfield

tag="005">20141121114503.0</controlfield>

<controlfield tag="008">091231

||az||||||||||||||||||||||||||| d</controlfield>

<datafield tag="040" ind1=" " ind2=" ">

<subfield code="a">NL-AmRIJ</subfield>

<subfield code="b">dut</subfield>

<subfield code="c">NL-AmRIJ</subfield>

<subfield code="e">fobidrtb</subfield>

</datafield>

<datafield tag="150" ind1=" " ind2=" ">

<subfield code="a">archeologie</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">klassieke oudheid</subfield>

<subfield code="0">(NLAmRIJ)126561</subfield>

</datafield>

9

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">archeologische sites</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">onderwaterarcheologie</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">pre- en protohistorie</subfield>

<subfield code="0">(NLAmRIJ)126580</subfield>

</datafield>

<datafield tag="680" ind1=" " ind2=" ">

<subfield code="i">Vertaling: archaeology</subfield>

</datafield>

<datafield tag="942" ind1=" " ind2=" ">

<subfield code="a">TOPIC_TERM</subfield>

</datafield>

</record>

Page 10: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Data Structure: MARC

<record>

<leader>00485nz a2200169o 4500</leader>

<controlfield tag="001">126536</controlfield>

<controlfield tag="003">NL-AmRIJ</controlfield>

<controlfield tag="005">20141121114503.0</controlfield>

<controlfield tag="008">091231 ||az||||||||||||||||||||||||||| d</controlfield>

<datafield tag="150" ind1=" " ind2=" ">

<subfield code="a">archeologie</subfield>

</datafield>

<datafield tag="680" ind1=" " ind2=" ">

<subfield code="i">Vertaling: archaeology</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">klassieke oudheid</subfield>

<subfield code="0">(NLAmRIJ)126561</subfield>

</datafield>

10

Record ID

Record “nl” label

Record “en” label

Hierarchical relation

Page 11: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Data Structure: SKOS

<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126536">

<skos:prefLabel xml:lang="nl">archeologie</skos:prefLabel>

<skos:prefLabel xml:lang="en">archaeology</skos:prefLabel>

<skos:narrower rdf:resource="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126561"/>

<skos:narrower rdf:resource="http://hdl.handle.net/10934/RM0001.LIBTHESAU.126580"/>

<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.TOPIC_TERM"/>

<skos:changeNote>

<rdf:Description>

<dct:modified>2014-11-21</dct:modified>

</rdf:Description>

</skos:changeNote>

<dct:created>2009-12-31</dct:created>

</skos:Concept>

11

Concept IDConcept “nl” label

Concept “en” label

Hierarchical relation

Page 12: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Converting the Thesaurus: MARC 550 Tag Errors. What are they?

12

In our concept “archeologie”:➔ Amount of hierarchical relations in MARC: 4➔ Amount of hierarchical relations in SKOS: 2What happened?

Page 13: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Converting the Thesaurus: MARC 550 Tag Errors. What are they?

13

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">klassieke oudheid</subfield>

<subfield code="0">(NLAmRIJ)126561</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">archeologische sites</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">onderwaterarcheologie</subfield>

</datafield>

<datafield tag="550" ind1=" " ind2=" ">

<subfield code="w">h</subfield>

<subfield code="a">pre- en protohistorie</subfield>

<subfield code="0">(NLAmRIJ)126580</subfield>

</datafield>

Concept ID

No code “0”

No code “0”

Page 14: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

MARC 550 Tag Errors (N = 14828)

code “w” code “a” code “0”

Error count 9 37 875

Correct entry example h boekwetenschap (NL-AmRIJ)126543

Entry error examples

w NULL NULL

9 mariaverering

(NL-AmRIJ)131820 (NL-AmRIJ)#129341

hippodromen (NL-AmRIJ) 14

Each error represent a hierarchical relation that cannot be converted to SKOS

Page 15: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

15

Page 16: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Language Coverage (N = 7826)

16

Low amount of English terms or alternative labels could be seen as a quality issue➔ This depends on the intended use of the thesaurus

nl en

prefLabel 7826 60

altLabel 1149 0

Page 17: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Quality Analysis Results (for all: N = 7826)

17

Quality Issue Count in MARC Count in SKOS After Skosify

Omitted or Invalid Language Tags 0 0 0

Incomplete Language Coverage 7766 7766 7766

No Common Language 0 0 0

Overlapping Labels 29 29 29

Empty Labels 0 0 0

Orphan Concepts 976 1391 1364

Cyclic Hierarchical Relations unknown 23 0

Valueless Associative Relations unknown 183 0

Omitted Top Concepts unknown 2043 0

Concept without a hierarchical

relation

Page 18: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Research Questions

1. What changes are caused when converting from MARC to SKOS?

2. What are the quality issues of the thesaurus expressed in SKOS?

3. How well can the Rijksmuseum collection and library thesauri be aligned with each other?

18

Page 19: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

What is an alignment?

An alignment is a:➔ Concept from two different thesauri that is found to be identical

Alignments are most commonly found by:➔ Exactly matching concept labels➔ Matching modified labels (stemming, lemmatization)➔ Comparing concept structures

19

Page 20: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

An Alignment Example

Concept in the library thesaurus:

<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.LIBTHESAU.129814">

<skos:prefLabel xml:lang="nl">kunstenaars</skos:prefLabel>

<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.TOPIC_TERM"/>

Concept in the collection thesaurus:

<skos:Concept rdf:about="http://hdl.handle.net/10934/RM0001.THESAU.38160">

<skos:prefLabel xml:lang="nl">kunstenaar</skos:prefLabel>

<skos:inScheme rdf:resource="http://hdl.handle.net/10934/RM0001.SCHEME.OCCUPATION"/>

20

This concept is seen as a topic term

This concept is seen as an occupation

Page 21: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Library Thesaurus Alignment onto the Collection Thesaurus Using Exact String Matching (N = 7826)

Selected Label Type

Selected Languages

Aligned Concepts

Aligned Concepts after Stemming

Percentage of Aligned Concepts

after Stemming

skos:prefLabel, skos:altLabel

nl, en 844 1030 13.16%

nl 840 1024 13.08%

en 3 4 0.05%

skos:prefLabel

nl, en 729 894 11.42%

nl 726 890 11.37%

en 3 4 0.05%

skos:altLabel

nl, en 13 16 0.20%

nl 13 16 0.20%

en 0 0 0.00%

21

Before stemming

After stemming

Page 22: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

Conclusion

▪ The mapping from MARC to SKOS tags proved to be a viable conversion method.

▪ SKOS provided both better insight into quality issues, and was supported by tools to fix them.

▪ The amount of alignments between the Rijksmuseum thesauri was low.

22

Page 23: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

What’s in it for the Rijksmuseum?

23

▪ Converting the thesaurus from MARC to SKOS would allow for better maintainability and interoperability

▪ Improving the thesaurus quality in terms of documentation and structure allows for more alignments to be made with other thesauri

Page 24: Thesaurus Analyzing the Rijksmuseum Librar y · The Rijksmuseum Librar y thesaurus is currently maintained in the MARC format This is a somewhat dated format The thesaurus has manual

THANK YOU FOR YOUR ATTENTION

Are there any questions?

Follow this project on Github:Special thanks to:Jacco van Ossenbruggen (VU - supervision)Chris Dijkshoorn (Rijksmuseum - supervision)Contact me:[email protected] 24