ProQuest Taxonomy Boot Camp Presentation 2008

39
Paula R. McCoy Manager, Taxonomy Development ProQuest [email protected] Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

description

Paul McCoy of ProQuest talks about life there before and after implementing the taxonomy management software from Synaptica.

Transcript of ProQuest Taxonomy Boot Camp Presentation 2008

Page 1: ProQuest Taxonomy Boot Camp Presentation 2008

Paula R. McCoyManager, Taxonomy Development

[email protected]

Finding a Common Language: Bringing Complex and Disparate

Vocabularies Together

Page 2: ProQuest Taxonomy Boot Camp Presentation 2008

Part of Cambridge Information Group & CSA

Headquartered in Ann Arbor, Michigan

Editorial offices in Louisville, Kentucky

Page 3: ProQuest Taxonomy Boot Camp Presentation 2008

Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current &

historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds

Subscription-based ProQuest® online information service available in academic and public libraries

Page 4: ProQuest Taxonomy Boot Camp Presentation 2008

Louisville editors abstract & index 4,000+ periodicals & newspapers

ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names

CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary

Page 5: ProQuest Taxonomy Boot Camp Presentation 2008

Description of ProQuest Controlled Vocabulary & Authority Files

Taxonomy Management -- Overview

Life Before Synaptica

Thesaurus Management System Purchase

Implementing Synaptica

Life With Synaptica

Topics of Discussion

Q&A

Page 6: ProQuest Taxonomy Boot Camp Presentation 2008

Created in 1970s for ABI/INFORM business database

Based on Library of Congress Subject Headings

Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies)

ProQuest Controlled Vocabulary

PQ CV

Page 7: ProQuest Taxonomy Boot Camp Presentation 2008

ProQuest Controlled Vocabulary

Thesaurus subjects:Business, economics & trade – 4300 termsScience, math & technology – 1600 termsMedicine – 1150 termsHumanities – 960 termsGovernment & policy – 850 termsEducation – 400 terms

Merged with general reference vocabulary in 1980s

Major development effort in past 4 years to boost science, education & medical terms

PQ CV

Page 8: ProQuest Taxonomy Boot Camp Presentation 2008

ProQuest CV: Statistics

Preferred terms: 11,046

Non-preferred terms: 5631

Scope Notes: 3194 (29%)

Cross-references (Broader, Narrower, Related terms): 67,700

Terms added in 2007: 77

Terms added in 2008: 58+

PQ CV

Page 9: ProQuest Taxonomy Boot Camp Presentation 2008

Authority Files: Statistics

Corporate/Organization Names: 438,098 Names added in 2008: 5489

Personal Names: 416,239 Names added in 2008: 1526

Geographic (Location) Names: 34,331 Names added in 2008: 144

Product Names: 38,210 Names added in 2008: 54

PQ CV

Page 10: ProQuest Taxonomy Boot Camp Presentation 2008

The Taxonomy Manager’s Job

Add subject terms as dictated by new concepts & new content to index

Maintain hierarchies & Scope Notes

Load updated Thesaurus to ProQuest interface

Manage authority files to maintain standards & control file size

Taxonomy Management

Page 11: ProQuest Taxonomy Boot Camp Presentation 2008

The Taxonomy Manager’s Job

Taxonomy Management

To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest

OBJECTIVE:

Page 12: ProQuest Taxonomy Boot Camp Presentation 2008

Thesaurus on ProQuest®

Taxonomy Management

Page 13: ProQuest Taxonomy Boot Camp Presentation 2008

Sample Subject Term

Taxonomy Management

Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow UF COPD BT Disease BT Respiratory diseases NT Asthma NT Bronchitis NT Emphysema RT Airway management RT Lungs

Preferred, or main termScope note defining term

and how it is used

Non-preferred term: points to term used to index

Terms broader in nature to main term: COPD is a

disease, and specifically, a respiratory disease

Terms narrower in nature to main term: these are

chronic lung diseases

Terms related to main term that might be used to

narrow the search

Page 14: ProQuest Taxonomy Boot Camp Presentation 2008

Before Synaptica

Managing terms meant:

Multiple files Duplicate entries Errors

= less than ideal thesaurus management

Page 15: ProQuest Taxonomy Boot Camp Presentation 2008

MS Word Document

Before Synaptica

Page 16: ProQuest Taxonomy Boot Camp Presentation 2008

Vocabulary Documents in Word

ProQuest controlled vocabulary

French-language controlled vocabulary

German-language controlled vocabulary

Spanish-language controlled vocabulary

Combined PQ-CBCA controlled vocabulary

Ethnic database vocabulary, English

Ethnic database vocabulary, Spanish

Before Synaptica

Page 17: ProQuest Taxonomy Boot Camp Presentation 2008

Foreign-Language Vocabularies

French German Spanish

Before Synaptica

Page 18: ProQuest Taxonomy Boot Camp Presentation 2008

Oracle Database Forms

Before Synaptica

Page 19: ProQuest Taxonomy Boot Camp Presentation 2008

Authority Files in Oracle

Class codes (related to subjects)

CORP names (391,665+ terms)

GEOG names (32,000+ terms)

PERS names (350,000+ terms)

PROD names (38,000+ terms)

NAIC codes (related to companies)

Before Synaptica

Page 20: ProQuest Taxonomy Boot Camp Presentation 2008

Adding New Terms

1. Enter full term hierarchy into new Word doc

2. Copy term into main Word-based vocabulary & enter reciprocal relationships

3. Enter term & relationships into Oracle

4. Review next-day report on Oracle activity

5. Send new term doc to editors via e-mail

6. Print new vocabulary (at least every two years)

Before Synaptica

Page 21: ProQuest Taxonomy Boot Camp Presentation 2008

UFRT

NT

BTSN

Class Code[whew!]

Page 22: ProQuest Taxonomy Boot Camp Presentation 2008

Thesaurus Management SystemsBuying Criteria

TMS Purchase

Synaptica

Up to 40 admin & 100 read-only users in multiple locations

Ability to load vocabs from multiple Word docs & Oracle authority files

Support for foreign-language vocabularies

Ability to add new vocabularies

Vendor onsite installation & training

Software upgrades & tech support

1. Ability to interact in real time with editorial system

2. Ability to accommodate authority files of 400,000+ names

Page 23: ProQuest Taxonomy Boot Camp Presentation 2008

Implementing Synaptica

Contract signed and work begun in August 2004

PQ sent to Synaptica all the Word & Oracle files for analysis

Decision points: how to load & structure data; how to handle “suspect” or erroneous relationships

Implementing Synaptica

Page 24: ProQuest Taxonomy Boot Camp Presentation 2008

Synaptica Data Analysis

Term Uniqueness Use Violations Self-Referencing Relationships One Relationship per Term Pair Relationship Unique

Circular References Relationship Reciprocates

Relationship Validation Tests:

Exception Reports delivered to PQ; Errors fixed before production

Implementing Synaptica

Page 25: ProQuest Taxonomy Boot Camp Presentation 2008

Use Validation Error

Marine resourcesMarine resourcesMarine resources

Implementing Synaptica

Page 26: ProQuest Taxonomy Boot Camp Presentation 2008

Terms with no language equivalent (LEQ), e.g., no translation

In all 3 languages, multiple English terms with the same translation, e.g.:

Foreign-Language Errors

Implementing Synaptica

English term Purchasing Shopping

Buyers Purchasing agents

French term Achats Achats

Acheteurs Acheteurs

French term-revised

Shopping

Agents d'achat

Page 27: ProQuest Taxonomy Boot Camp Presentation 2008

Solution:

Issue: Different editorial systems = 2x data entry: once for Synaptica, once for Oracle

Overnight synchronization process to copy Synaptica work into Oracle every night

Synch process discontinued April 2008

Final Challenge

Implementing Synaptica

Page 28: ProQuest Taxonomy Boot Camp Presentation 2008

Putting Synaptica Into Production

Deal with people resistant to change

Train users — provide documentation & hands-on demonstrative training

Encourage written feedback on system functionality

Send feedback to Synaptica – many of our suggestions implemented in later versions

Nov 2004

Implementing Synaptica

Page 29: ProQuest Taxonomy Boot Camp Presentation 2008

Life With SynapticaTerms Management Made Easy!

Word – Old, Bad Synaptica – New, Good

Life With Synaptica

Page 30: ProQuest Taxonomy Boot Camp Presentation 2008

Adding Terms Today: 3 Easy Steps

2. Export report of new terms into Word

1. Enter term and relationships into Synaptica “Item Details” window

3. Send Word document to editors

Life With Synaptica

Page 31: ProQuest Taxonomy Boot Camp Presentation 2008

Improving Thesaurus ManagementCategories Feature

Life With Synaptica

Page 32: ProQuest Taxonomy Boot Camp Presentation 2008

Subject Term Categories

Life With Synaptica

Page 33: ProQuest Taxonomy Boot Camp Presentation 2008

CORP Names – Categories & Website

Life With Synaptica

Page 34: ProQuest Taxonomy Boot Camp Presentation 2008

Foreign-Language Vocabularies

Life With Synaptica

Language Equivalents

Page 35: ProQuest Taxonomy Boot Camp Presentation 2008

Foreign-Language Vocabularies

Life With Synaptica

Page 36: ProQuest Taxonomy Boot Camp Presentation 2008

Foreign-Language Vocabularies

Life With Synaptica

Spanish

German French

Spanish

Alphabetical by language

Page 37: ProQuest Taxonomy Boot Camp Presentation 2008

Synaptica Updates

Life With Synaptica

Synaptica version 6.0 released in early 2006

Synaptica version 7.0 is being implemented now: • Enhanced user interface • Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration• Expanded Reporting functionality • Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing• Improved global term editing• Online help and user guides

Page 38: ProQuest Taxonomy Boot Camp Presentation 2008

Benefits of Synaptica

Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For

Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics

Increase in Company name NPTs — from 1935 to 8952 today

Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs

Life With Synaptica

Easier loading of updated Thesaurus on PQ interface

Page 39: ProQuest Taxonomy Boot Camp Presentation 2008

Questions?

thank you!