LBSC 670 Information Organization. Today Guest Speaker –Jeremy York – HathiTrust Classification...

40
LBSC 670 Information Organization

Transcript of LBSC 670 Information Organization. Today Guest Speaker –Jeremy York – HathiTrust Classification...

LBSC 670

Information Organization

Today

• Guest Speaker –Jeremy York – HathiTrust

• Classification Thoughts and CV– Overview & History– Related concepts– Examples

• A note on MARC specifications

Classification concpets

Aboutness, specificity, granularity

“Words have power,“ - classification systems exist within a socio-political context

Classification methods Manual/automatic, Pre/Post coordinate, Hierarchical/faceted, formal/social

CV overview

• What are controlled vocabularies?– Types– Basic concepts

• How are cv created and maintained– Metadata standards– Example Systems

• When does a CV turn into a KO?– Term Lists, Thesauri, Taxonomies,

Ontologies

Controlled Vocabularies

“organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” (Warner via Leise, Fast)

“the primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval” (ANSI Z39.19)

Knowledge Organization

• “tools that present the organized interpretation of knowledge structures” (Hjørland)

• “classification schemes that organize materials at a general level…, subject headings that provide more detailed access, and authority files that control variant versions of key information” (Hodge)

Uses of controlled vocabulary

• Define scope, content, and context of a body of knowledge

• Support discovery - Navigation, search, browsing

• Map information objects to user terminology

• Enforce term consistency and relationships

A good CV. . .

• Removes ambiguity

• Defines relationships between things

• Contextualizes information

A+

CV Concepts

• Content Analysis– Ambiguity– Synonymy– Exhaustivity– Specificity– Co-extensivity– Aboutness– Semantic structure– Warrant (User, Literary, Organization)

• Form Analysis– Linguistics– Grammar– Semiotics– Single / Multiple terms

• Indexing & Retrieval– Pre vs. Post Coordinate– Recall vs. Precision– Natural language processing (NLP)

http://bit.ly/lbsc_670_cv

Content Analysis

• Ambiguity– Each term should relate to a single concept

• Synonymy– Each concept should be identified by a single entry

• Specificity– Using the most specific words or phrase expressing the subject

• Exhaustivity– The extent to which the entire document is indexed (Summarization,

depth)• Co-extensivity

– “Assign as many terms as needed to bring out the main theme, and according to guidelines sub-themes.” (p. 29, Lancaster)

– “nothing more, nothing less”• Semantic Structure

– Terms can be related with equivalence, hierarchy, or associated relationships (Use, See, NT, BT, RT)

Content Analysis (2)

• Aboutness = Subject/topic?– Wilson (1968)

• Author intent, topicality, relationship to other resources, textual analysis

– Farithorne (1969)• Intentional aboutness (author), extensional aboutness

(document)– Maron (1977)

• objective about (document), subjective about (user), and retrieval about (information retrieval)

– Hjorland (2001)• “Closely related to theories of meaning, interpretation,

and epistemology”

Content Analysis (3)

• Wilson’s criteria for evaluating aboutness (1968)– Identify author’s purpose (intent)– Weigh the predominant topics, elements

(topical analysis)– Group/count a document’s use of concepts

and references (bibliometrics)– Identify essential elements (text analysis)

Content Analysis (4)• Literary Warrant

– “The inclusion of a vocabulary term in a controlled vocabulary based on its appearance in one or more content items. For example, a medical text may use the term “oncology.” Based on literary warrant, that term would be included in the controlled vocabulary even though the general public uses the term “cancer.” (Glosso-Thesaurus)

• User Warrant– “The inclusion of a vocabulary term in a controlled vocabulary based

on use by users. Such terms can be identified through search log analysis or free listing.” (Glosso-Thesaurus)

• Organizational Warrant– “Justification for the...selection of a preferred term due to the

characteristics and context of the organization using the resource” (ANSI Z39.19)

Form Analysis

– Linguistics• Synatx/Form (grammar)• Morphology (internal word structure)• Semantics (meaning)• Pragmatics, discourse analysis (word/phrase

use)– Semiotics

• study of signs/symbols – Lexical structure

• Document layout, markup, tags (think DOM)

Indexing & Retrieval

• Pre/Post-Coordinate• Organization prior to retrieval• Organization at the point of retrieval

• Recall / Precision• Recall: Number of retrieved relevant docs / total number

of docs in collection• Precision: number or retrieved relevant docs / all relevant

docs in collection

• Natural language processing• Uses semantics and syntax to automatically distill

‘aboutness’

Recall & Precision• A collection of 100

documents• Searches

– “Vocabularies”• Recall 100/100 = 1• Precision 100/100 = 1

– “Facet”• Recall 20/100= .2• Precision 20/28 = .71

– “OWL”• Recall 1/100 = .001• Precision 1/1 = 1

CV Entry # of docsControlled Vocabularies

100

Faceted analysis 20

Ontologies 5

OWL 1

RDF 3

Recall = # of docs retrieved / total # of docs in collection

Precision = # relevant of docs retrieved / total relevant # of docs in collection

Types of Controlled Vocabularies

• Term Lists– Glossaries, Dictionaries, Gazetteers, Folksonomies

• Synonym rings– Z39.19 example– Oracle Text

• Taxonomies– Website navigation scheme

• Thesauri / Ontologies– Authority files, subject thesauri, topic maps

http://www.taxotips.com/

Thesauri & taxonomy examples

• List of vocabularies– http://www.slais.ubc.ca/resources/indexing/

database1.htm

– Taxonomy warehouse• Two Examples

– Health & Ageing Thesaurus– Thesaurus of Geographic names

CV Structures

• Organization structures– Hierarchical systems

• Term Lists / Enumerative systems• Hierarchies• Tees

– Facets / Associative relationships– Folksonomies

Hierarchies

• Features– Inclusiveness– “Is-a” relationship– Inheritance– Transitivity– Systematic– Mutually exclusive– Neccesary and

sufficient

From http://bit.ly/lbsc_670_cv

Relationships

• Equivalence ( Term Lists)– “use”, “see”, “isVersionOf”, “isFormatOf”

• Hierarchical (Thesauri, Taxonomies)– Generic – “is a”– Partitive – “is part of”, “has part”, “has conceptual

part”, “member of”– Instance –

• Associative (Facets, Ontologies)– “isReferencedBy”, “isRequiredBy”, “hasDerivative”

Faceted vocabularies

Multi-dimensional, multi-relationship driven, Subject, Object, Predicate

From http://bit.ly/lbsc_670_cv

Folksonomy

• Features– Single level

description– Open vocabulary

list– User

supplied/harvested tags

http://trendistic.indextank.com/

Term List Examples

• Authority files – Maps to preferred terms– Library of Congress– Encoded Archival Context– Union List of Artist Names

• Glossaries/Dictionaries –Words & definitions, sometimes topic focused– Glosso-Thesaurus

• Folksonomies –– Contextualization, Trend discovery, Personal Information

• Synonym rings – Used for back-end equivalence in searching– Princeton Wordnet

Choosing a framework

• Use questions– Who is your user, what are their needs?– What systems are your users familiar with?– Will this system be internal/external?

• Content questions– How extensive, defined is the information?– Is your subject matter static or fluid?– What organizational framework best describes your content?

• System Questions– What access are you trying to provide?– What external pressures exist?– What external entities/theories will interact with this system?

Thesauri Definitions

– “Guide to use of terms, showing relationships between them, for the purpose of providing standardized, controlled vocabulary for information storage and retrieval”(Monash)

– “A list of words showing similarities, differences, dependencies, and other relationships to each other”(USG)

Creating a CV (1)

• Design methods– Re-use existing, start with content & desired use

ideas– Committee / community approach

• Top-down– Concept driven

• Bottom-up– Document driven

– Empirical approach• Deductive approach

– Select terms, create relationships, perform term control• Inductive approach

– Establish CV at outset, build hierarchies on as needed basis

Creating a CV (2)

• Top-Down (deductive)– Identify audience– Identify all topics, concepts, uses, and context of the domain– Sort topics identified into an appropriate organization scheme

(enumerative, hierarchical, faceted)– Solidify structure and clean up gaps & redundancies– Assign documents to categories, test retrieval

• Bottom-up (Inductive)– Identify audience– Survey documents for topics/concepts.– Build system on the fly – let content drive structure and limits

of system– Identify gap & redundancies in system– Test retrieval

Creating a CV (3)

• Think about scope, use, content, maintenance• Gather Terms

– Based on existing systems, content– Based on user needs/expectations– Investigate issues of specificity, exhaustivity, granularity

• Build hierarchies, relationships– Broader/narrower terms, Related terms, Use/Use for, see/see

also• Establish Rules• Implement• Evaluate• Maintain

http://www.boxesandarrows.com/view/creating_a_controlled_vocabulary

Evaluating a CV

• Goals• Determine if the CV solves retrieval needs of

user/system• Determine if CV matches user’s content

model/term expectations

• Methods• Expert evaluation of CV• User based card sorting compared to actual CV• Identification of non-included documents• Analysis of use of system - HCI

CV Maintenance• Primary responsibility

– Editor, board, committee• New terms

– Is it really new or a different view– What is the proper form & placement

• Modified terms– Include a change log– Use a “USE” reference to point to new term

• Deleted terms– Unused / Overused terms– May want to keep for historical retrieval purposed

• Modification history– Use modification notes, date/time stamps

Thesauri Concepts

• Preferred terms• Non-preferred terms• Semantic relations between terms• How to apply terms (guidelines,

rules)• Scope notes• Adding terms (How to produce terms

that are not listed explicitly in the thesaurus)

Common thesaural identifiers

• SN Scope Note – Instruction, e.g. don’t invert phrases

• USE Use (another term in preference to this one)

• UF Used For• BT Broader Term• NT Narrower Term• RT Related Term

Thesauri Guides• National Information Standards

Organization. (2005). Guidelines for the construction, format, and management of monolingual thesauri. ANSI/NISO Z39.19-2005. Bethesda, MD: NISO Press. – http://www.niso.org/standards/resources/Z39-19-

2005.pdf?CFID=5559601&CFTOKEN=31747314

• Aitchison, Jean & Gilchirist, Alan. Thesaurus Construction: A Practical Guide. 3rd ed. London: Aslib, 1997.

• Willpower Information Management Consultants– http://www.willpower.demon.co.uk/thesprin.htm

Thesaurus Exploration

• http://www.getty.edu/research/tools/vocabularies/tgn/

• Protégé introduction and tour– What is protégé?– What is it used for?– How will we use it this semester?

Webster’s Dictionary

• Webster’s Third New International Dictionary defines Ontology as:1. A science or study of being, specifically

a branch of metaphysics* relating to the nature and relations of being.

2. A theory concerning the kinds of entities and specifically the kinds of abstract entities that are to be admitted to a language system.

*Metaphysics: Nature of being “or” existence.

Next Week

• Work time for Protégé• Exploration of ontologies