© 2008 IBM Corporation ® eMail and Records Management with IBM Classification Module Jon Dellaria,...

36
© 2008 IBM Corporation ® eMail and Records Management with IBM Classification Module Jon Dellaria, IBM Certified ECM Information Technology Specialist

Transcript of © 2008 IBM Corporation ® eMail and Records Management with IBM Classification Module Jon Dellaria,...

© 2008 IBM Corporation

®

eMail and Records Management with IBM Classification Module

Jon Dellaria, IBM Certified ECM Information Technology Specialist

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

What is Classification?

Definition:

Class.i.fic.a.tion [klas-uh-fi-key-shuhn] – n – the act of assigning an element (a document for example) to a category.

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

IBM – Leadership in Text Analysis and Classification

IBM has a 50+ year history in text analysis and discovery

– As early as 1957, IBM published pioneer research done on text classification (and related topics, such as text search, and automatic creation of text abstracts)

IBM invests ~$50M annually in research and development for search and text analytics

– 200 people actively engaged in R&D

– IBM holds over 200 patents in information access with more each year

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Options for Implementing the Classification Process

Low High

High

Low

Cost Savings

Productivity

Accuracy

ManualClassification

Authoring Templates

Rules BasedClassification

Context BasedClassification

MultipleMethods

Simple Rules

Complex Policies

Consistent Participation & Enforcement

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

IBM Classification ModuleImplementing the classification process in ECM & more

Intelligent applications of policies via automatic, advanced classification

Combines the best automatic methods: context sensitive and rule-based

Flexible automation levels accelerate adoption and acceptance

Incorporates user feedback in real-time to improve understanding

Integrated to IBM ECM architecture or use as a free-standing service

12 languages – and 3 more on the way!

Low High

High

Low

Low High

High

Low

ManualClassification

Rules BasedClassification

Context BasedClassification

MultipleMethodsICM

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Advanced Classification is Key to Compliant Information Management

11

22 33

44

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Advanced Classification: The Facts

Every manual classification forced on your users will cost your organization 17 cents in productivity

4

3

2Wide-spread adoption of archiving or records management in your organization will lead to large, measurable productivity loss

4

Compliance professionals hold the incorrect assumption that humans are the best option for piece by piece decision-making

3

Results of human-reliant filing are inconsistent and inaccurate, resulting in effective accuracy of 50%, at best

2

ImplicationsFacts

Unstructured content makes up 80% of the volume of information in the average enterprise and that segment is growing 30% annually

1Business users find forced manually classification “burdensome” and at least 50% will not participate

Deploying an archiving or records management initiative is increasingly important, large scale and difficult problem

1Humans provide, at best, marginally better accuracy in executing classification, in controlled tests

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Critical Dimensions of Classification

Cost (per doc)

Accuracy

Increasing Volume

Consistency

Manual Automated

92% 50 – 80%

$ 0.17 < $ 0.01

<50% 100%

X46%

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Participation Impacts Accuracy

National Archives and Records Administration Study

– Electronic Records Management initiative focused on user driven records declaration

– 6+ month study

– 60% drop-off in participation in months after training

End users frequently outright refuse to categorize content

1 2 3 4 5 6 7

Manual classification and an emphasis on “user training” is outdated, providing inconsistent and inaccurate results

Participation in Manual Filing; by Month

Inconsistent participation from humans is the critical factorin evaluating different classification methods

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Manual Classification

With paper

With rudimentary electronics

Today’s advanced electronics

Low High

High

Low

Low High

High

Low

ManualClassification

Rules BasedClassification

Context BasedClassification

MultipleMethods

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Rules-based Classification

Simple Rules: Does the body contains the phrase “sure thing”?

Did the CFO send the email?

Metadata extraction: Does the body of the email have anything that matches

the pattern “XXX-YY-ZZZZ”?

Complex Policies: Does the body contains the phrase “sure thing” in

the same sentence as “stock"?

Did the sender belongs to the “broker” email group and send an email externally using the phrase

“sure thing” in the body?

To: Bob Smith <[email protected]>From: Bill Roker <[email protected]>Subject: Market Movement

Bob,Hope you’re doing well. I’ve got a sure thing going with the stock we spoke about on the phone. I think its time to pull the trigger for my client.

The client’s name is John Doe. His social is 123-45-6789. He’s totally on board and he’s excited to take advantage of this new offer.

Talk to you tomorrow,Bill

Bill Roker212-555-1234Financial Advisors, Inc.

To: Bob Smith <[email protected]>From: Bill Roker <[email protected]>Subject: Market Movement

Bob,Hope you’re doing well. I’ve got a sure thing going with the stock we spoke about on the phone. I think its time to pull the trigger for my client.

The client’s name is John Doe. His social is 123-45-6789. He’s totally on board and he’s excited to take advantage of this new offer.

Talk to you tomorrow,Bill

Bill Roker212-555-1234Financial Advisors, Inc.

Low High

High

Low

Low High

High

Low

ManualClassification

Rules BasedClassification

Context BasedClassification

MultipleMethods

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Rule-based Classification’s Achilles’ Heel:Rule Maintenance, Accuracy and Cost

Time

Accuracy

Changes in business

Effort to adjust rules to new environment

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Context Sensitive Classification

Statistic-Based

Categorization

Category 1 Category 2

Category 3

Unclassified text

Low High

High

Low

Low High

High

Low

ManualClassification

Rules BasedClassification

Context BasedClassification

MultipleMethods

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Context Sensitive Classification

Low High

High

Low

Low High

High

Low

ManualClassification

Rules BasedClassification

Context BasedClassification

MultipleMethods

Simple rules or keyword based analysis can be too coarse to make fine distinctions between long-form texts with very different intent

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Choosing the Right Classification Method

Combined approaches provide the maximum accuracy from automation, at a slight productivity cost

Automated methods slash the costs

Manual methods have high costs associated to them

Manual methods suffer from lack of participation, hampering their overall viabilityLow High

High

Low

Cost Savings

Productivity

Accuracy

ManualClassification

Authoring Templates

Rules BasedClassification

Context BasedClassification

MultipleMethods

Simple Rules

Complex Policies

Consistent Participation & Enforcement

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

IBM ECM

RecordsManagement

ElectronicDiscovery

AdvancedClassification

ContentCollection

1

3

2 4

Enterprise Compliance VisionIntegrated Agile ECM Platform for Compliant Information Management

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Reclassification & Records Management

File plan:Legal

File plan:Marketing

File plan:Finance

File plan:Research &Development

. . .

Review &Audit

RecordsManagement

IBM Classificatio

n Module

ECMRepository

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

18

US Army Email and Records Manager Pilot

GOAL

Provide a means to address Army’s requirement for the successful records management of email

– Challenges faced:

• Lack of records management follow through from end users• Need to capture records and transactional activities from email• Need to capture records without user intervention

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

19

US Army Email and Records Manager Pilot

Success Criteria for pilot:

– Correctly capture and retrieve email provided

– Ensure information is secure

– Determine email can be accurately Auto Categorized by the IBM Categorization Module (ICM)

• Goal of 90% or better accuracy• Show how ICM learns and improves accuracy over

time

– Place categorized record emails under correct Army records disposition

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Army Email Pilot Concept of Operations (CONOPS)

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

21

Concept of Operations

Tasks Phase I Phase II Phase III

Identification of Records Categories

Delivery of .pst files

Organization of .pst files to build knowledge base

Ingesting of Emails – Build Corpus

Ingesting of Emails - Auto Cat Runs

Auditing

complete

complete

complete

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Pilot Phases Pre-Phase Activity

– Teach the system by building the knowledge base (Corpus)

Phase I

– Process the first run of sample .pst files

– Review and Audit the results

Phase II (30 days later)

– Process the second run of sample .pst files

– Review and Audit the results

Phase III (30 days later)

– Process the third run of sample .pst files

– Review and Audit the results

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Knowledge Base (Corpus) Training

Record Category: Marketing

Record Category: Legal

Record Category: Finance

Record Category: R&D

. . .

ArmyRecords Managers

User 1 Email

PST Inboxes

Organized Email

User 2 Email

User n Email

. . .

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Outlook Configuration

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Building the Knowledge Base for Email Categorization

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Reports

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Training Knowledge Base - The ResultsRaw Data Adjusted Data

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

28

Pilot Project Pre-Phase Activities

Build Categorization Knowledge Base

• Work with Army Records Managers to define the most appropriate records categories and identify example mails for them

Goal:

– Find examples of email records for each of the record categories

– Find 15 – 20 examples for each category

Results:

– 54 records categories were identified as being associated with the assigned offices

• 28 categories have 15 or more examples

• 26 categories have 14 or less examples

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Army Email Pilot Phase I – III Auto Categorization Steps

...

Review &Audit

IBMP8 eMailManager

RecordsManagement

SearchEngine

eMailArchive

Record Category: 690 (Personnel)

Record Category: 37 (Budget and Resource Management)

Record Category: 25-30y (Publication Reports)

Record Category: 1hh (Temporary Duty)

. . .

Spam and Non RecordsRetention: 90day

IBMCategorization

Module

.PST Files

P8 ‘InBox’ Folder

1 Army Records Manager

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

30

First Pass of Categorization (process .pst files) Take the Knowledgebase created by Army Records Managers and apply it to the bulk of email

Measure categorization results returned and begin Audit and Review process

Audit and Review process Audit – Used to confirm the accuracy of categorization via a random sampling of categorized

results. If necessary, the chosen category may be modified which serves to retrain the knowledgebase for the future

Review – items that do not meet the defined thresholds for categorization are available for further analysis and categorization by records personnel

The result of Audit and Review is improved the accuracy of the knowledgebase therefore improved categorization for future email ingest

Post Audit/Review reprocessing of email to measure categorization improvements

Measure results for the completion of each Phase

Pilot Project Phase I – III Activities

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

31

Pilot Project Activities

Focus on email from 16 different offices across Army

• Demonstrate ability to categorize emails across Army enterprise

PST files from 398 pre-selected users

• 581,634 emails in total in Phase I

• 581,256 emails in total in Phase II

• 735,333 emails in total in Phase III

• 1,898,232 total emails through Phase III PST files transferred to the pilot system via secure connection

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

32

Phase I Categorization Results

Total Categorized 84.5% 98.8 %

Total Not Categorized 15.5% 1.2%

First Pass Post Audit/Review

Total Categorized 99.01% 99.9 %

Total Not Categorized .9% .1%

First Pass Post Audit/Review

Phase II Categorization Results

Total Categorized 98.4% 99.9 %

Total Not Categorized 1.6% .1%

First Pass Post Audit/Review

Phase III Categorization Results

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

Army Records Manager Observations As a records manager with a 25-year background in federal and

civilian records management, I believe the automatic categorization of information is the next logical evolution in managing the records of an organization.

The classifier correctly identifies categories of records based on information from office file plans. Since office file plans are incorporated within an agency records manual, the initial input for the system is nominal. The office file plan becomes the document classifier.

Because the classifier retains information on document retrieval activity, it may be appropriate for use in many other information management program areas, including the Freedom of Information and Privacy Act.

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

34

Demo

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

35

Thank You

© 2008 IBM Corporation

Information Management software | Enterprise Content Management

IBM Records Manager with Army File Plan