Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007.

13
Noisy Text Analytics: An Exercise in Noisy Text Analytics: An Exercise in Futility? Futility? Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007

Transcript of Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007.

Page 1: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Noisy Text Analytics: An Exercise in Futility?Noisy Text Analytics: An Exercise in Futility?

Rohini SrihariJanya, Inc.

www.janyainc.com

8 January 2007

Page 2: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Overview: Noisy Text Analytics

• All Text is Noisy!– Does not fit shrink wrapped processing, adaptation is

necessary

• Business and national security interests in processing:– Open source data (e.g. web pages)

– Consumer generated media (Blogs, newsgroups, chat, text messaging, etc.)

• Key is to identify analysis requirements clearly– Not necessary to understand everything

Page 3: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Challenging Problems

• Mixed modalities

– Structured and unstructured; free text cannot be processed in a vacuum; need to correlate information from different sections

– Text with images, figures

• Improve within document information consolidation, Cross-document information consolidation

• World models for discourse processing

– Need to bring in more context; relate text analytics to semantic web activities (DAML/OWL)

– Dynamic use of online resources

• Adaptive text analytics

– extraction requirements are constantly changing, so is data!

– Corpus-based learning

• Flexible architectures

– Integrating additional preprocessing, handling streaming data etc.

Page 4: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

USMTF Document Structure

OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

Page 5: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Sample Document

OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

Sets

Page 6: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Sample Document

OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

Fields

Page 7: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Sample Document

OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

Free-text field

Page 8: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Sample Document

TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0//ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON/MAXRECUP:6MON//GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND ISFUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES ISCLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPITVIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATINGTO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRESIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TORECONSTITUTE C2 EQUIPMENT//

Entity Description/Name Field

Page 9: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Sample Document

TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0//ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON/MAXRECUP:6MON//GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND ISFUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES ISCLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPITVIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATINGTO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRESIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TORECONSTITUTE C2 EQUIPMENT//

Reference to Structured Sets from Free Text

Page 10: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Cross-Document Entity Profile

Page 11: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Corpus-Based Learning

• Training phase requires four inputs– Document repository (unlabeled training data)– Config file1 for DTL Context (how to create unlabeled train data)– Seed file (how to label a small amount of unlabeled train data)– Config file2 for Learning Tool

• How to learn a model• How to use learned model in Semantex

DTLContext

DocumentRepository

LearnedModel

Config File1

LearningTool

Trainer

TrainingData

Seed File

Config File2

Page 12: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Versatility of learning tool applied to different tasks

• Example: Nominal Event Classifier– Seedfile: 95 unambiguous

event nominals, 295 unambiguous nonevent nominals

– Repository: News texts processed by Semantex

– Config file (DTL): Look at features surrounding nouns

– Config file (LearningTool): Learn using a mixture model

• Example: Disease outbreak Classifier– Seedfile: 10 verb types

representative of disease outbreak

– Repository: Medical reports processed by Semantex

– Config file (DTL): Look at features surrounding verbs

– Config file (LearningTool): Learn using distributional similarity

Example: Name Disambiguation

• Are two instances of Tom Smith the same individual?

Page 13: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc.  8 January 2007.

Conclusions

• Dealing with noisy text is not a futile exercise!– Already commercial applications available

– Need to specify analysis requirements clearly

– Adapt IE technology appropriately