JSTOR Sustainability Collection - DHUG 2015
-
Upload
rdsnyderjr -
Category
Technology
-
view
33 -
download
1
Transcript of JSTOR Sustainability Collection - DHUG 2015
JSTOR Sustainability CollectionSharon Garewal, JSTOR Senior Metadata Librarian
Ron Snyder, ITHAKA Labs Director of Research and Development
Overview
Sustainability collection defined
Utilization of the thesaurus within the sustainability collection
Subject matter experts enlisted
Results
Live demo
JSTOR- a quick primer
3,200+ journals & 30,000+ books
9.3 million full length articles
70 million pages
2.9 million book reviews
138 million content accesses in 2013
100 million searches per year
http://www.jstor.org/
Sustainability Collection: what will it be?
Driver: Emerging interdisciplinary area that JSTOR wanted to support in both research and teaching needs.
Core topics of Cities and Urbanization, Food and Agriculture, Industrial Ecology, Resource Economics, Forestry and Land Use and Environmental Policy and Law
Composed of journals, books, grey literature (working reports, research reports, technical reports etc.)
Specialized functionality to support research by including semantic indexing to help researchers locate related terms and concepts. This is where the JSTOR Thesaurus (JTHES) comes into play!
JTHES19 Top terms, 57,470 Terms;
103,129 rules
The challenge
To assemble a list of key terms in Sustainability
The terms will be used to organize and tag sustainability-related research articles on JSTOR starting in 2015.
These terms will also be used for an auto complete function in the search component.
Utilize the JTHES in a live prototype
This was the first project where we looked at how to use the thesaurus as an intelligence layer within a collection. How should it work? How do we do this?
How do we get this done? The options…
Create a new thesaurus for sustainability:
Pros: Specific to sustainability
Cons: Remembering to make changes in more than one place. Cost associated with creating and maintaining a separate thesaurus
Create a sustainability branch within JTHES:
Pros: Could BT (Broader terms) all relevant branches and terms from elsewhere in the JTHES into 1 branch
Cons: Redundant; Multiple BT’s clutter up the JTHES
Create a facet to tag terms within JTHES as “Sustainability”:
Pros: Creates a flat list (in faceted view) of all of the terms in that facet; Easy to maintain
Cons: Does not show a hierarchy; Cannot have multiple facets
The road to sustainability…
Research: examined existing glossaries and thesauri created by research libraries, discipline associations and individual scholars in each of the disciplines.
Existing terms (pulling lists)
Existing branches (clean up)
Adding new terms
Adding new branches: Food studies, Urban studies, etc.
Constructing new rules and refining existing rules
Testing content
Enlisting Subject matter experts
Contacted faculty members in ten disciplines to go over the subset of terms assembled in their discipline and review those terms with an eye toward:
Is this how people in the field express this concept?
Is it correctly included in the sustainability facet?
Are there any important terms or concepts that we've missed? (including acronyms, synonyms, variant spellings, inverted phrases)
SME spreadsheetsEach SME was slightly different in how they approached their subject areas with some SMEs being reluctant to give much feedback and others giving large amounts of feedback to sift through.
Example of terms pulled from Law, Public administration/policy and International/global studies
View- Facet provides
alphabetical list of all
tagged terms
The Results!! labs.jstor.org/sustainability
Implementation of the Sustainability Prototype
The thesaurus and semantic index are used for content discovery and presentation
The identification of a “sustainability collection” from the JSTOR corpus was performed using topic modeling (specifically LDA – Latent Dirichlet Allocation)
A model of 100 topics was generated from the content
Staff assigned sustainability scores for each of the topics based on a review of the top words in each topic
Each document in the JSTOR corpus was then assigned a sustainability score of 0-9 based on the sustainability scores for the topics most closely associated with the document
Weighting of document-level indexed terms
Document-level weights were computed for each sematic term using TF-IDF
TF-IDF is a measure of how important a word is to a document in a collection
The TF-IDF value increases proportionally to the number of times the word appears in a document (the ‘TF’ or term frequency), but is offset by how common the word is in a corpus (the ‘IDF’ or inverse document frequency)
The TF-IDF weighted terms are used to:
order the terms displayed for each document
boost document relevancy when index terms are used in discovery
Auto-suggest and refining results
[Thesaurus slide: a new thing, metadata we create, screenshot(s) of Sustainability Portal]
Refinements in our use of the thesaurus and semantic index in sustainability Auto calculation of sustainability score using LDA topics and thesaurus
sustainability facet
Calculate topics and term correlations
Compute sustainability score for each topic based on the most relevant terms and sustainability facet
Compute a sustainability score for each corpus document based on topic weights and topic sustainability score
Automated LDA topic labeling
Labeling topics generated by unsupervised topic modeling is an ongoing challenge
We’re investigating the feasibility of using the same topic/term correlations used to compute sustainability scores to assign labels
Attempts to find the thesaurus term that best characterizes the most highly correlated terms for each topic
Other JSTOR Labs projects/tools using the thesaurus and semantic index
http://labs.jstor.org/jthes/
http://labs.jstor.org/snap/
http://labs.jstor.org/readings/
Thesaurus Visualization Tool
And some other JSTOR Labs projects
http://labs.jstor.org/reflowit/
http://labs.jstor.org/shakespeare/