A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps...

23
+ A Case-based Approach to Content Analysis in Cross-Domain Information Sharing 1 Research support by Airforce Research Lab (AFRL), Rome, New York, under SBIR Phase I (Contract number FA8750-08-C-0110) and SBIR Phase II (Contract number FA8750-07-C-0117) Thomas Reichherzer, Badri Lokanathan, Ashwin Ram Enkia Corp., Atlanta GA KSCO 2012, Pensacola, FL

Transcript of A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps...

Page 1: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ A Case-based Approach to Content Analysis in Cross-Domain Information Sharing

1

Research support by Airforce Research Lab (AFRL), Rome, New York, under

SBIR Phase I (Contract number FA8750-08-C-0110) and

SBIR Phase II (Contract number FA8750-07-C-0117)

Thomas Reichherzer, Badri Lokanathan, Ashwin Ram

Enkia Corp., Atlanta GA

KSCO 2012, Pensacola, FL

Page 2: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Presentation Outline

Cross-Domain Information Sharing & Challenges.

Automated Approaches for Reliable Human Review.

Experimental Evaluation.

Discussions and Conclusion.

2

KSCO 2012, Pensacola, FL

Page 3: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Cross-Domain Information Sharing

Government and industry alike benefit from information sharing.

industry:

develop and expand new partnerships among business partners,

public relations

government:

exchange of mission-critical information across different agencies

freedom-of-information act

Sharing takes place across institutional boundaries or security domains.

information cannot be freely shared

3

KSCO 2012, Pensacola, FL

Page 4: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Reliable Human Review

Information must be reviewed to remove sensitive content.

released information must be in compliance with non-disclosure policies across security domains

policies guide release of information

Information review completed by review officers (e.g. FDO).

reviewer identifies sensitive content in document to be removed priority release

review process is time intensive and requires significant human expertise

policies are complex and subject to changes

KSCO 2012, Pensacola, FL

4

Reliable Human Review (RHR) presents a significant bottleneck to “just-in-time” information needs.

Page 5: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Just-Enough Information Sharing

Problem: Identifying shareable information in documents is time consuming and

laborious.

Security policies are high-level and difficult to capture by rules.

Timely dissemination of appropriate information is critical in crisis situations.

5

Need:

Tools for assistance with the classification of information across multiple security domains.

Tools to develop and apply security policies.

Proposed Approach:

Assist RHR by automatic text classification of unstructured text.

A combined approach of Natural Language Processing (NLP) and Case Based-Reasoning (CBR) to automate process of selecting sensitive content.

KSCO 2012, Pensacola, FL

Page 6: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Assisted RHR: Automation Steps

Select documents that have been marked up by RHR.

mark-up indicates sensitive content with respect to release domain

mark-up captures non-disclosure policies

KSCO 2012, Pensacola, FL

6

Training

P

Recommending

case base

Feedback

Feed marked up documents into a text classifier.

“learn” security policies from mark-up information

mark-up is labeled into categories

train different classifiers for different release domains

Apply classifier to unmarked documents.

use feedback from RHR to adjust classifier

Page 7: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Sanitizing Unstructured Text Workflow

7

Analyst opens/authors

document in MS Word or Web Editor.

Analyst selects policy to be applied.

Analyst reviews automatically

generated markup.

Analyst edits text to enforce selected

policy.

P

KSCO 2012, Pensacola, FL

Page 8: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Policy Creation Workflow

8

Act of Terrorism

Defense

Secret

Defense Acts of

Terrorism

Analyst reviews existing markup.

Analyst identifies markup

categories.

Analyst creates release policies.

KSCO 2012, Pensacola, FL

Page 9: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

9

Problem-Solving Steps

Learning User Feedback

Decision-Making Recommendation

Text Analysis

Content Query

Unmarked Documents

Mark-up recommendations are generated at sentence level!

Page 10: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

10

How Recommendations are generated

1. User gives classifier some examples.

Sensitive

Non-sensitive.

2. User presents a new problem to classifier.

?

3. Classifier retrieves similar cases.

4. Classifier decides on classification.

5. Classifier gives a recommendation.

Page 11: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

11

Architecture Overview

Marked-up

Documents

Text Classification System

Document Analyzer

Case Builder

CBR Engine

Classifier / Marker

Learner

Unmarked

Documents

User

Feedback

Case Libraries

& Indices

Page 12: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

12

Building a Case Base

Marked-up

Documents

Case Library

Training Workflow

Text Analyzer

Case Builder

CBR Engine

Sentence Segmenter

Pronoun Resolver

Sentence Parser

Domain A

Domain A

Page 13: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

13

Generating Recommendations

User

Feedback

New

Documents

Case

Library

Classification Workflow

Document Analyzer

Case Retriever CBR Engine

Marker / Classifier

Marked-up

Documents

Learner

Domain A

Page 14: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

14

Input Sentence - Example

BEARCLAW aircraft operating inside Friendlandia along the Narcotica border have intercepted communications indicating the site of a large heroin processing facility at PK848972 approx 4km north of the village of Lago Springo.

information not to be disclosed

Page 15: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

15

Case Construction - Parsing

[[NP BEARCLAW aircraft] [VP operating inside Friendlandia along the Narcotica border]] [VP [VP have intercepted] [NP communications] [S [VP indicating] [NP the site of a large heroin processing facility] [PP at PK848972 approx 4km north of the village of Lago Springo.]

Page 16: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

16

Case Construction - Mapping

[[PROBLEM [SUBJECT BEARCLAW aircraft] [SUBJECT_SUB1 operating inside Friendlandia along the Narcotica border] [PREDICATE have intercepted] [OBJECT communications] [OBJECT_SUB1 indicating the site of a large heroin processing

facility at PK848972 approx 4km north of the village of Lago Springo]]

[SOLUTION [MARKUP operating Friendlandia Narcotica border] [CLASSIFICATION NDP-1 Category 1?]]

Page 17: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

17

Feature Vector of Case

Features: SUBJECT, SUBJECT_SUB1,

PREDICATE, OBJECT, OBJECT_SUB1, …

[[PROBLEM

[SUBJECT w1 w2 … wn]

[SUBJECT_SUB1 w1 w2 … wn]

[PREDICATE w1 w2 … wn] [OBJECT]

[OBJECT_SUB1 w1 w2 … wn]] [SOLUTION [MARKUP w1 w2 … wn] [CLASSIFICATION c]]

Page 18: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

18

Distance between Sentences

S1: aircraft transport troops & equipment.

S2: aircraft fly over enemy territory.

S3: troops are deployed in foreign country. S1

S2

S3

[[SUBJECT aircraft]

[PREDICATE fly over]

[OBJECT enemy territory]]

[[SUBJECT troops]

[PREDICATE are deployed in]

[OBJECT foreign country]]

S2 – S3 sentence comparison:

[[SUBJECT aircraft]

[PREDICATE transport]

[OBJECT troops & equipment]]

[[SUBJECT aircraft]

[PREDICATE fly over]

[OBJECT enemy territory]]

S1 – S2 sentence comparison:

[[SUBJECT aircraft]

[PREDICATE transport]

[OBJECT troops & equipment]]

[[SUBJECT troops]

[PREDICATE are deployed in]

[OBJECT foreign country]]

S1 – S3 sentence comparison:

Page 19: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

19

Distance Metric (Sentences)

Let S1 = <f11,f12,f13,f14,f15>

S2 = <f21,f22,f23,f24,f25> features

D(S1,S2)= k1sim(f11,f21) + k2sim(f12,f22)

+ k3sim(f13,f23) + k4sim(f14,f24) +

k5sim(f15,f25)

k1, k2, k3, k4, k5 Parameter:

Page 20: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+

20

Distance Metric (Words)

If f11 = [w111,w112,…,w11n] and

f21 = [w211,w212,…,w21n]

where

sim(f11,f21) = sim(w111,w211)+

sim(w111,w212)+ … + sim(w11n,w21n1)

Lin Thesaurus

Page 21: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Performance Evaluation – Data Set

IMdb database.

movie descriptions on selected movies rated PG, PG-13, and R

description was manually classified using 6 different categories:

21

KSCO 2012, Pensacola, FL

Category Number of cases

general violence 65

graphic 730

nudity 498

drug use 349

dark topic 298

sexual content 6

Page 22: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Performance Evaluation – Training and Testing

Train classifier using marked-up information from movie descriptions.

Condition A: use all mark-up data or 1946 cases.

Condition B: use 90% of the mark-up data (randomly selected) or 1752 cases.

Error rate:

Under condition B, some cases generalize.

KSCO 2012, Pensacola, FL

22

Condition A Condition B

miss-labeled 30.5% 33.1%

not-labeled 1.6% 1.6%

Page 23: A Case-based Approach to Content Analysis in Cross-Domain ... · Assisted RHR: Automation Steps Select documents that have been marked up by RHR. mark-up indicates sensitive content

+ Conclusions

Information sharing is critical; automated methods are needed. methods need to go beyond keyword checking

Proposed approach captures human expertise in classifying information. policies are indirectly captured as cases in case base

Markup and classification generated at sentence level, not document level.

Direct feedback from reviewer refines and revises system’s classification knowledge.

Scalability affected by larger training sets. more examples improve accuracy

more examples slow down classification

KSCO 2012, Pensacola, FL

23