Centralized Taxonomy Management for Enterprise Information Systems

40
© Copyright 2007 Dow Jones and Company, Inc. Centralized Taxonomy Management for Enterprise Information Systems Enterprise Search Summit Wednesday, September 24th, 2:00 pm – 2:30 pm Dow Jones Client Solutions ProQuest Synaptica Manager, Taxonomy Development [email protected] [email protected]

description

Daniela Barbosa, Synaptica Business Development Manager, Dow Jones Client Solutions, Dow Jones & CompanyPaula R McCoy, Manager, Taxonomy Development, ProQuestNow that you have built your taxonomies, you need to manage and maintain them in a centralized environment that can be leveraged by all of your enterprise applications including search tools, portals, and CMS/DMS systems. This session will review some best practices in centralized taxonomy management and go through the implementation of a thesaurus management tool at ProQuest, which enabled them to create a common language to connect disparate information assets using large and varied vocabularies and authority files linked to new and existing editorial systems.

Transcript of Centralized Taxonomy Management for Enterprise Information Systems

Page 1: Centralized Taxonomy Management for Enterprise Information Systems

© Copyright 2007 Dow Jones and Company, Inc.

Centralized Taxonomy Management for Enterprise Information Systems

Enterprise Search Summit Wednesday, September 24th, 2:00 pm – 2:30 pm

Dow Jones Client Solutions ProQuest Synaptica Manager, Taxonomy [email protected] [email protected]

Page 2: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

Dow Jones Taxonomy Solutions

Words Dow Jones taxonomy

licensing Other taxonomy licensing

(Taxonomy Warehouse) Taxonomy customization Taxonomy development

Expertise Taxonomy Assessment

Taxonomy Consulting

Analysis

Recommendations

Implementation

Workshops

Tools Synaptica:

Taxonomy / Metadata -- Management Tool

Page 3: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

A taxonomy is a hierarchical topic structure to which information can be assigned through the dual

processes of classification (filing to a location) and categorisation (tagging with relevant metadata).

A taxonomy provides browsable navigation and supports filtered searching

Some Definitions

A thesaurus is a controlled vocabulary linking an organisation’s common language to its taxonomy

structure. It accommodates synonyms, acronyms, language variants and other near equivalences. It

also signposts non-hierarchical linkages within and across the taxonomy facets. A thesaurus is usually

employed to interpret and guide user search queries

An ontology is the working model of entities and interactions in a particular domain of knowledge or

content set. It is a set of concepts - such as things, events, and relations - that are specified in some

way in order to create an agreed-upon vocabulary for exchanging information. An ontology is

increasingly used to visualise (or map) a set of search results and discover new or hidden

connections

Page 4: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

Classic taxonomy…groups things or

concepts into families

UP

DOWNSIDEWAYS

Traditional thesaurus…captures the different names of the family

members and explores some more distant

associations(cousins & close friends)

Multi-

Directional

Emerging ontology…shows a network ofmulti-dimensional relationships and

properties both within and outside the family groups

Page 5: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

TelephonesIs a broader term than

Mobile Phones

UP

DOWNSIDEWAYS

Mobile PhonesAKA as

Cell Phones &Hand Phones

And Similar toHand Held Devices

& PDAs

Multi-

Directional

Mobile Phones

Are made by

Phone Manufacturers

And use the networks ofTelecoms

Service Providers

Page 6: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

•Metadata’s Evolutionary Path

Dictionaries& Flat Lists

HierarchicalTaxonomies

ControlledVocabularyThesauri

Ontologies

StructuredAuthority Files

Metadata is evolving organically – the less

complex metadata elements form the building blocks for creating the more complex

structures

Page 7: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

Portal navigation and browsable website menus Conceptual access to large databases  Records management and cataloging e-Commerce online product catalogues Inventory control and de-duplication Auto-classification of internal documents and email Multilingual search and browse Metasearch of enterprise-wide resources

Practical Applications

Page 8: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

CentralizedTaxonomy

ManagementSystem

Synaptica®

PortalsPortals Categorizers

PortalsPortalsSearch Engines

PortalsPortalsContent Portals

Multiple usersworking in

collaborative and compartmentalized

space

Permissions

Centralized Taxonomy and Metadata Management

As a centralized repository for multi-lingual semantic management that is:

- Independent from systems like web-portal search and categorization systems - Scalable; capable of evolving with emerging corporate semantic standards

HTML

CSV

XML

ZThes

SKOS

OWL

WebServices

Page 9: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

Metadata can transcend information islands and data silos but only if the enterprise is committed to common standards

A centralized system that supports both collaboration and compartmentalization allows common metadata to be shared while also allowing user communities the independence to manage specialized metadata files

Why Centralized?

Page 10: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

Enterprises are increasingly making use of multiple proprietary and open source software tools for categorization, search and portal tasks

While many of these tools support some level of metadata management the diversity of standards, data formats and business rules they support can actually result in exacerbating the data silo problem by creating metadata silos

Why Independent?

Page 11: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

DMS CMSShared

Docs

News &

ResearchData

Search Engine

Taxonomy & Metadata Platform

Information Processing, Management and Storage

Where taxonomy fits with Search

Page 12: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

4 Good Reasons for Taxonomy

Search Relevancy

Search Completeness

Search Federation

Search Visualisation

Effective Research/Risk Mitigation

Knowledge Worker Productivity

Discovery & Innovation

Better & Faster Decisions

Page 13: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

1. Improved Search Relevancy

Ambiguity of Language Is a Blackberry a fruit or a handheld device?

By including this brand name in a taxonomy we can give context to the user search query

In a telecoms domain we can assume that the user means the latter and only return content tagged as such

Alternatively we can weight the results, promoting those documents about handheld devices above those that refer to the fruit

Either way the result is increased search precision which translates into time savings

Page 14: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

2. Improved Search Completeness

Synonymous and Related Term Relationships Mobile Phone (PT) = Cell Phone (NPT) = Hand Phone

(NPT) Mobile Phone is related to Hand Held Device (RT)

User Search Query = “Cell Phones” The taxonomy simultaneously broadens the search and

prioritises the returned results giving increased recall without compromising relevancy

Content tagged with Mobile Phone category are promoted over those not tagged using a weighting in the search algorithm

Content tagged with Hand Held Device category may also receive a weighting

Page 15: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

3. Search federation and data integration

A snapshot or dashboard is often more desirable than a list of document titles or snippets, especially when looking for information on a customer, supplier or competitor

Also, information will most likely reside in a number of internal repositories, each with their own levels of metadata structure

Taxonomy allows the combination of news, internal CI reports, price plans, coverage data, market share data, share price etc. in one consolidated view by providing mappings or cross-walks

This is essentially applying business intelligence discipline to the world of unstructured information

Page 16: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

4. Search Visualisation

The previous three scenarios assume the user knows what they are looking for

But what about serendipitous discovery?

By being able see across an aggregation of content and extract facts and relationships from deep within the information stores, true (and sometimes fortunate) discovery can take place

Page 17: Centralized Taxonomy Management for Enterprise Information Systems

Proprietary and Confidential | © Copyright 2007 Dow Jones and Company, Inc.

Document, Content& Records

Management

Synaptica®Vocabulary & Metadata

Management

Thesauri

Ontologies

Filing & Storage

MetadataTagging

(Categorisation)Process

SearchEngine

Visualisation

Navigation

Intranet / PortalUser Interface

Back EndInformation Structure

Front EndInformation Intelligence

Librarians; Taxonomists; Indexers;Knowledge & Information Managers

Information Creators;Records Managers;Content Managers;Librarians; Indexers

Information Users(the business; the public)

Taxonomies

CIOs; CTOs;IT Architects

Page 18: Centralized Taxonomy Management for Enterprise Information Systems

Paula R. McCoyManager, Taxonomy Development

[email protected]

Centralized Taxonomy Management for Centralized Taxonomy Management for Enterprise Information SystemsEnterprise Information Systems

Page 19: Centralized Taxonomy Management for Enterprise Information Systems

Description of ProQuest Controlled Vocabulary & Authority Files

Taxonomy Management -- Overview

Managing Terms Manually

Synaptica Thesaurus Management System

Topics of DiscussionTopics of Discussion

Page 20: Centralized Taxonomy Management for Enterprise Information Systems

Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current &

historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds

Subscription-based ProQuest® online information service available in academic and public libraries

Page 21: Centralized Taxonomy Management for Enterprise Information Systems

ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names

CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary

Page 22: Centralized Taxonomy Management for Enterprise Information Systems

Created in 1970s for ABI/INFORM business database

Based on Library of Congress Subject Headings

Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies)

ProQuest Controlled VocabularyProQuest Controlled Vocabulary

Page 23: Centralized Taxonomy Management for Enterprise Information Systems

ProQuest Controlled VocabularyProQuest Controlled Vocabulary

Thesaurus subjects:Business, economics & trade – 4300 termsScience, math & technology – 1600 termsMedicine – 1150 termsHumanities – 960 termsGovernment & policy – 850 termsEducation – 400 terms

Merged with general reference vocabulary in 1980s

Major development effort in past 4 years to boost science, education & medical terms

Page 24: Centralized Taxonomy Management for Enterprise Information Systems

ProQuest CV: StatisticsProQuest CV: Statistics

Preferred terms: 11,046

Non-preferred terms: 5631

Scope Notes: 3194 (29%)

Cross-references (Broader, Narrower, Related terms): 67,700

Terms added in 2007: 77

Terms added in 2008: 58+

Page 25: Centralized Taxonomy Management for Enterprise Information Systems

Authority Files: StatisticsAuthority Files: Statistics

Corporate/Organization Names: 438,098 Names added in 2008: 5489

Personal Names: 416,239 Names added in 2008: 1526

Geographic (Location) Names: 34,331 Names added in 2008: 144

Product Names: 38,210 Names added in 2008: 54

Page 26: Centralized Taxonomy Management for Enterprise Information Systems

The Taxonomy Manager’s JobThe Taxonomy Manager’s Job

Add subject terms as dictated by new concepts and new content to index

Maintain hierarchies & Scope Notes

Load updated Thesaurus to ProQuest interface

Manage authority files to maintain standards & control file size

Page 27: Centralized Taxonomy Management for Enterprise Information Systems

The Taxonomy Manager’s JobThe Taxonomy Manager’s Job

To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest

OBJECTIVE:

Page 28: Centralized Taxonomy Management for Enterprise Information Systems

Sample Subject TermSample Subject Term

Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow UF COPD BT Disease BT Respiratory diseases NT Asthma NT Bronchitis NT Emphysema RT Airway management RT Lungs

Preferred, or main termPreferred, or main term

Scope note defining term and how it is used

Scope note defining term and how it is used

Non-preferred term: points to term used to index

Non-preferred term: points to term used to index

Terms broader in nature to main term: COPD is a

disease, and specifically, a respiratory disease

Terms broader in nature to main term: COPD is a

disease, and specifically, a respiratory disease

Terms narrower in nature to main term: these are

chronic lung diseases

Terms narrower in nature to main term: these are

chronic lung diseases

Terms related to main term that might be used to

narrow the search

Terms related to main term that might be used to

narrow the search

Page 29: Centralized Taxonomy Management for Enterprise Information Systems

New scientific content requiring a huge enhancement to vocabulary

Seven MS Word vocabulary documents— English and foreign language (French, German, Spanish)—printed for internal use only

Six authority files & 3 vocabulary files in Oracle databases, requiring duplicate entry of subject terms in Word and Oracle

Legacy editorial system in process of being replaced

Managing Terms ManuallyManaging Terms Manually

Page 30: Centralized Taxonomy Management for Enterprise Information Systems

Thesaurus Management SystemsThesaurus Management SystemsBuying CriteriaBuying CriteriaThesaurus Management System: Thesaurus Management System: RequirementsRequirements

Eliminate double entry

Improve editorial interface with vocabulary

Automate entry of reciprocal relationships

Page 31: Centralized Taxonomy Management for Enterprise Information Systems

Life With SynapticaLife With Synaptica

Word – Old, Bad Synaptica – New, Good

Page 32: Centralized Taxonomy Management for Enterprise Information Systems

Adding Terms Today: 3 Easy StepsAdding Terms Today: 3 Easy Steps

2. Export report of new terms into Word

1. Enter term and relationships into Synaptica “Item Details” window

3. Send Word document to editors

Page 33: Centralized Taxonomy Management for Enterprise Information Systems

Improving Thesaurus ManagementImproving Thesaurus ManagementCategories FeatureCategories Feature

Page 34: Centralized Taxonomy Management for Enterprise Information Systems

Subject Term CategoriesSubject Term Categories

Page 35: Centralized Taxonomy Management for Enterprise Information Systems

CORP Names – Categories & WebsiteCORP Names – Categories & Website

Page 36: Centralized Taxonomy Management for Enterprise Information Systems

Foreign-Language VocabulariesForeign-Language Vocabularies

Language EquivalentsLanguage

Equivalents

Page 37: Centralized Taxonomy Management for Enterprise Information Systems

Foreign-Language VocabulariesForeign-Language Vocabularies

Life With Synaptica

SpanishSpanish

GermanGerman FrenchFrench

Spanish

Alphabetical by languageAlphabetical by language

Page 38: Centralized Taxonomy Management for Enterprise Information Systems

Synaptica UpdatesSynaptica Updates

Synaptica version 6.0 released in early 2006

Synaptica version 7.0 is being implemented now: • Enhanced user interface • Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration• Expanded Reporting functionality • Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing• Improved global term editing• Online help and user guides

Page 39: Centralized Taxonomy Management for Enterprise Information Systems

Benefits of SynapticaBenefits of Synaptica

Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For

Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics

Increase in Company name NPTs — from 1935 to 8952 today

Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs

Easier loading of updated Thesaurus on PQ interface

Page 40: Centralized Taxonomy Management for Enterprise Information Systems

thank you!