© 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of...

39
1 © 2007 IBM Corporation The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness Reginald J. Twigg, Ph.D. ([email protected]) Capture, Classification and Taxonomy, IBM ECM

Transcript of © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of...

Page 1: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

1© 2007 IBM Corporation

The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their EffectivenessReginald J. Twigg, Ph.D. ([email protected])Capture, Classification and Taxonomy, IBM ECM

Page 2: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

2© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Agenda The Challenge of Unstructured Content

Key Concepts and Terms

Taxonomy, Classification and ECM Adoption

Classification Technologies for ECM

Page 3: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

3© 2007 IBM Corporation

The Challenge of Managing Unstructured Content

Page 4: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

4© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

80% of Enterprise Data is Unstructured

Databases

• Billing statements• Claims images• Customer

correspondence• Mortgage docs• Contracts• Signed BOLs• Healthcare EOBs• Marketing collateral• Website content• Voice authorizations• Signature cards• Credit enrollments• Material Safety

Data Sheets• ISO 9000 docs• Plant schematics• Product images• Spec sheets

• ….and much more!

Page 5: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

5© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

What is Enterprise Content?

Page 6: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

6© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Where do I start?

We’ve got 600 GB of content from basic content services all over the enterprise.How can we get this content efficiently mapped into our ECM taxonomy?

We’ve been managing our content without classifying it for a few years now.How can our users navigate amongst this existing content in a way that’s intuitive for our business?

The lawyers have to review 400,000 electronic documents for their case.How can we make sure they don’t waste their time?

Organizing the explosion of unstructured content becomes critical:

Page 7: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

7© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Key Business Drivers

Increase worker productivity and automate content related decisions

Ad Hoc Category Suggestion

Content-Based Workflow Selection

Content Based Decision Making

In Process Classification

Increase accessibility of content under management

Automated, High Scale Classification

Classify at ingestion and/or re-classify over time

Taxonomy Evolution Tools

Enhanced Accessibility

Taxonomy Proposer

ECM Taxonomy and Classification

Increase legal discovery review effectiveness while reducing risk

Legal Discovery Prioritization and Workflow Assignment

Records Classification and Exception Handling

Storage and Retention Policy Assignment

Compliance, Records, Legal Discovery

Reduce inquiry costs, automate message routing and increase customer satisfaction

Email, Chat Routing

Agent Response Suggestion

Email Supervision and Monitoring

Automatic Customer Response

Message Tagging, Classification and Monitoring1 32 4

Business Value of Classification for ECM

Page 8: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

8© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Percent of corporate information value managed in traditional databases

Percent of corporate information value managed in traditional databases

DataCreation

And Demand

DataCreation

And Demand

OLTP and BI(narrow scope)OLTP and BI

(narrow scope)Application

TypesApplication

TypesCompliance, Competitive Intelligence (wide scope)Compliance, Competitive Intelligence (wide scope)

Source: GartnerSource: Gartner

UnstructuredData

StructuredData

Ability to Structure Content with Databases

Page 9: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

9© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Multiple Repositories Make Access Difficult

36%

14%

25%

17%

1 repository5%

2-5 repositories

6-10 repositories10-15 repositories4%

More than 15 repositories

Don't know

Base: 81 North American decision-makers(multiple responses accepted)

“The Future of Content in the Enterprise,” Connie Moore and Robert Markham

Page 10: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

10© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

And Then There’s SharePoint, File Shares and . . .

Page 11: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

11© 2007 IBM Corporation

Key Concepts and Terms

Page 12: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

12© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Key Concepts

Metadata: a means of describing, locating, cataloging, and activating content as objects in a software ecosystem (literally, data about data).

Enterprise Catalog: a centralized and normalized metadata model for unstructured content for the purposes of providing consistent services across all ECM applications.

Taxonomy: a hierarchical structure of information components, any part of which can be used to classify a content item in relation to other items in the structure.

Classification: a coding of content items as members of a group for the purposes of cataloging them or associating them with a taxonomy.

Page 13: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

13© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Taxonomy Is . . .

Not turning animals into trophies

A system for organizing the corpus of business content

Page 14: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

14© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Taxonomy and Classification in ECM

Classification Examples:

– Document Classing– Foldering

Taxonomy Examples:

– Enterprise Content Catalog– Industry Standard Document Taxonomies (ISO, XMI)

Methods:

– Rules-Based: Applies pre-determined rules for ‘if, then’ classification of text and properties

– Analytics-Based: Applies algorithms to interpret classes in order to apply classification rules to them

Page 15: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

15© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

ECM Taxonomy Illustrated

Page 16: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

16© 2007 IBM Corporation

Taxonomy, Classification and ECM Adoption

Page 17: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

17© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Drive New Business Value from Content

Content Classification

Solutions

Improve Content Access Organize Unstructured Content

Derive Business Insight

Page 18: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

18© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Business Drivers for ECM Taxonomy Management

Proliferating departmental solutions

– Content Management

– Collaboration (SP, Quickr, Team Rooms, Wikis)

User-based classification and high workforce turnover

– Productivity declines as knowledge disappears

– Legal discovery is a secondary concern

Mergers and Acquisitions – need to reconcile disparate content management practices, repositories and processes

Page 19: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

19© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

1

Classification is Hard Work

Key Business ChallengesECM Taxonomy and Classification

Most organizations face content taxonomy pain – especially as they standardize around ECM

– Mapping content to taxonomy during ingestion

– Reclassifying content under management

– Evolving taxonomies as new types of content emerge

– Integrating folksonomies (SharePoint) into a master taxonomy

Increase accessibility of content under management

Automated, High Scale Classification

Classify at ingestion and/or re-classify over time

Taxonomy Evolution Tools

Enhanced Accessibility

Taxonomy Proposer

1

Page 20: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

20© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Organization is the Root Cause

Most organizations face content taxonomy barriers – especially as they standardize around ECM

– Assigning categories en masse

– Reclassifying existing content as taxonomies evolve

– Merging taxonomies

– Integrating the wisdom of folksonomies

Page 21: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

21© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Challenges and Impacts of Merging Taxonomies

Misclassification – change is constant, and master taxonomies must manage multiple custom taxonomies for each content source

“Folksonomies” from departmental collaboration solutions are created by users and unmanaged by ECM standards

Impact: – Unreliable Metadata – Inconsistencies lose or

mislabel content– Process Misfires – Poor metadata triggers

incorrect events and workflows

Scale is the Challenge – Automation is Essential

Page 22: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

23© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Lessons Learned From ERP Adoption

Getting Classification Right: ‘Garbage in = garbage out’ is often used in metadata management projects to describe the problem of building a metadata model on inconsistent sources.

Driving Process on Taxonomies: ERP systems depending on 3 master taxonomies – material, vendor and customer. These taxonomies drive events, workflow definition and the development of transaction-centric business process applications

Mastering Metadata: The ability to deploy new enterprise applications depends upon the re-usability, scalability and integrity of the metadata model

System of Record is Required for Standardization:

– Establishes an enterprise standard that can be audited

– Forms the foundation for building demonstrable best practices

– Enforces consistency of data capture and output

Page 23: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

24© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Customer Lessons for Mastering ECM Taxonomies

‘Master’ taxonomy of record required for

– Compliance

– Business process applications

Merged master taxonomies become large and unwieldy

– Multiple taxonomies require integration and translation

– Centralized, decentralized, or hybrid?

Intelligent Classification increasingly is used to manage:

– Taxonomy merging from multiple use cases

– Taxonomy/folksonomy translation from distributed content sources

Page 24: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

25© 2007 IBM Corporation

A Look at ECM Classification Technologies

Page 25: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

26© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

State of Classification Management Technologies

ECM Classification/Taxonomy is an emerging discipline

– Industry standard taxonomies:

• Focus on business function or transaction types

• Have not reached the enterprise level– Classification best practices:

• Content ingestion

• Application development reclassification Classification software focuses on content ingestion:

– Electronic content (email, Office documents, free-form text)

– Paper content (document images) requires OCR

Search is not enough – must drive value in the business process

Page 26: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

27© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Criteria For ECM Classification Management Solutions

Integrate with and support the ECM metadata model

Interpret a highly-federated content ecosystem

Go beyond search to catalog and manage content

Build on advanced analytic technologies – rules alone are not enough

– Interpret content to extract meaningful (meta)data

– Employ multiple methods (engines) for classification

– Integrate teaching/learning

Page 27: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

28© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Common Platform for Electronic Content Classification

Email QueueClassification and

Monitoring

In Process Classification

ECM Taxonomy and Classification

Compliance, Records, Legal

Discovery

ClassificationPlatform

Page 28: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

30© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

IBM Classification Module for Electronic Content

Organize your ECM content

Automated classification and filtering

Combines text analytics understanding with rules

Acquires domain specificity from your own content

Unique learning technology for adaptive classification

Suggests new categories or even seeds an entirely new taxonomy

Rectifies conflicting taxonomies

Market proven, scalable platform

Page 29: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

31© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Understanding Content with Text Analytics

Match

ing

Categories list andRelevancies

(Scores)

ClassificationEngine

ClassificationEngine

Corpus(Categorized)

The strategic value of this market is paramount to IBM

The strategic value of this market is paramount to IBM

Audit

Training (Teach)

Feedback

CThe core marketfor this newproduct has beendefined as such by IBM

CThe core marketfor this newproduct has beendefined as such by IBM

A

IP isessential

A

IP isessential

ALegal iscurrentlyrequiringfull approval

ALegal iscurrentlyrequiringfull approval

BEngineeringrequires clearrequirements

BEngineeringrequires clearrequirements

CStrategy isImportant tothe marketing team

CStrategy isImportant tothe marketing team

C: 97%, B: 54%,A: 12%

The strategic value of this market is paramount to IBM

The strategic value of this market is paramount to IBM

Page 30: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

32© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Classification Workflow: Accelerating Content Organization

FileSystem

Classifier

ExistingUnclassified

Managed Content

Classification Review

Tool

Filter out documents

Automatically categorize majority of content

Reference: Integration Components

Classifier (Runtime Application)

Classification Review (UI)

Taxonomy Proposer (UI)

Content Extractor (training based on P8)

Send to taxonomy proposer

BasicContentServices

Page 31: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

33© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Components of the Solution for Text Classification

Classifier

– Automatically classifies and filters out documents

– Moves some documents for manual review

Classification Review Tool

– Allows user to manually review documents

Content Extractor

– Extracts content from the ECM system for training

Taxonomy Proposer

– User workflow to identify and name new categories or apply existing taxonomy from P8

Page 32: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

34© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Classification for Paper Documents

Classification of paper documents occurs in capture process

Use cases for paper document classification

– Recognition using OCR/ICR

– Classification to associate to folders or doc class

– Separation to reduce costs and improve process

Page 33: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

35© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Three Primary Types of Images – The Document Recognition Problem

Less Advanced

More Advanced

Semi-Semi-StructuredStructured

StructuredStructured

Un-StructuredUn-Structured

Page 34: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

36© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

The Document Separation Problem in Image CaptureSeparation of documents is a

significant expense for a high-volume capture system

– Typical ‘structured’ recognition technologies are not applicable

– Manual insertion of separator sheets is the primary workaround today

– 50% of document preparation labor is spent sorting documents and inserting separator pages – source: TAWPI

Where does one document stop and the next begin?

Here?Here? Here?Here? Here?Here? Here?Here?

Page 35: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

37© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Classification Methods for Paper Content (Images)

Image Classification

– based on the overall layout and structure of a document

– Includes lines, boxes, logos and placement of text

Text Classification

– based on detailed analysis of the text content of a page

Rules-Based Classification

– performed by searching for specific data or keywords

– independent of layout

Templated Classification

– determined by the presence of one or more marks, barcodes or items of text in pre-defined locations

Page 36: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

38© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Waterfall Approach to Classification and SeparationTwo-pass system:

1st pass: Classification

– optimizes performance by using fastest classification techniques first

– Advanced Text Classification final “catch-all

11 22Page # 33 44 55 66 77 88

ImageClassification:

N/A ? ?? ? ?

Rules Based : N/A N/A N/A ?

Text Classification:

N/AN/AN/AN/A N/AN/A

BarcodeRecognition: ? ?? ? ? ? ?

1 ms

20 ms

200 ms

1000 ms

FirstForm X

FirstForm Z

FirstForm Y

LastForm X

LastForm Z

LastForm Y

MiddleForm X

MiddleForm Z

?

Page 37: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

39© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Why Invest in Automated Classification?

Accelerate the time to value in your investment in ECM

Free up your subject matter experts

Ensure more accurate content catalogs

Make your content easier to find and leverage

Page 38: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

40© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Summary

1. Accelerate ECM StandardizationPoor content classification undermines ECM value – maximize your ECM

potential and time-to-value with automated classification

2. Automating Classification Always PaysTypical employees spend 10 hours/week searching for information – slash

that time and increase productivity

3. Classification Technologies Automate Classification to Drive Development of Best Practices

IBM Classification Module for IBM FileNet P8Automatically organizing your content by understanding it

Page 39: © 2007 IBM Corporation 1 The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their Effectiveness.

41© 2008 IBM Corporation

Information Management Software | Enterprise Content Management

Contact Reggie Twigg ([email protected]) for more information or to arrange a demonstration